Friday, December 15, 2006

Varozhka: Creating an estimator (part 3)

Overview

The same results can be achieved with Notepad and csc.exe from .NET Framework distribution, but I would recommend to use Visual Studio 2005 for that. I'm using VS2005 Standard Edition, but I'm pretty sure you can use any edition. If you don't have VS2005 installed - get your free copy of VS2005 Express Edition at http://msdn.microsoft.com/vstudio/express/.

Estimator creation

1. Run VS2005 and create a new Class Library project. Name the project as MyEstimator, and save it (say, to C:\Projects).

New Project dialog

2. Add references to Varozhka.Processing.dll and Varozhka.TrainingData.dll assemblies. These assemblies located at the directory where you extracted the Varozhka package (C:\Varozhka).

Add Reference dialog

3. Create a new class named as TestEstimator. Inherit the class from BaseEstimator class. It's an abstract class to set some constraints to descendant classes. Implement required constructor and override GetRating() method - this is a function where you have to implement your algorithm. For now let it return some constant value, say - 3.0.

The TestEstimator class should look something like:

    1 using System;
    2 using Varozhka.Processing;
    3 using Varozhka.TrainingData;
    4 
    5 namespace MyEstimator
    6 {
    7     public class TestEstimator : BaseEstimator
    8     {
    9         public TestEstimator(NetflixData netflixData) :
   10                                         base(netflixData)
   11         {
   12         }
   13 
   14         public override float GetRating(int movieId,
   15                                         int customerId,
   16                                         DateTime date)
   17         {
   18             return 3.0f;
   19         }
   20     }
   21 }

Basically, the dumb estimator is ready, and you can use it in estimations... but we will try to create something more complicated.

4. Let's create an estimator which will return average rating of the customer. We should sum up ratings of all rated movies, and divide it by number of ratings. Our estimator contains NetflixData property, which gives access to Netflix information. In the current implementation it allows to get:

  • movies rated by user (GetMoviesByCustomer() method)
  • customer/rating pairs for a movie (GetPacksByMovie() method)
  • customer who watched a movie (GetCustomersByMovie() method). This method is pretty slow in the current implementation, and I would not recommend to use it.
  • rating of the movie by the given customer (GetRating() method).

Straightforward implementation will be:

   14 public override float GetRating(int movieId,
   15                                 int customerId,
   16                                 DateTime date)
   17 {
   18     // get all movies watched by the customer
   19     short[] movies = NetflixData.GetMoviesByCustomer(customerId);
   20 
   21     // get sum of all ratings
   22     int sumRatings = 0;
   23     foreach (short movie in movies)
   24     {
   25         sumRatings += NetflixData.GetRating(movie, customerId);
   26     }
   27 
   28     // calculate average rating
   29     return (float)sumRatings / (float)movies.Length;
   30 }

 This implementation has several drawbacks. Say:

  • it will give wrong results for probe set, because we should ignore rating for movie with passed movieId.
  • it's a good idea to cache calculated movie ratings.

But the purpose of the sample is to show ability of indexes, so I'm ignoring these issues.

5. Compile the project in Release mode. Now we have working estimator, and it's ready to use in prediction generator ;)

Check the next post on how to generate predictions.

NOTE: the project is under development, so that will obsolete some day.

Labels: , , , ,

0 Comments:

Post a Comment

<< Home