Friday, December 15, 2006

Varozhka: Generation of a prediction set (part 4)

Now you should have Netflix dataset, Varozhka and a compiled test estimator at your machine. Please, check previous posts if you don't.

Before you start the processing I would recommend to unload all unneccessary stuff, because importing and processing are CPU- and memory- consuming tasks.

1. Run Varozhka.UI.exe. If this is a first run, it will ask you about settings:

Settings dialog

You should provide:

  • Directory with the Netflix dataset.
  • Directory where prediction sets should be generated.
  • Assembly with an estimator (if you followed steps from the previous post it should be at C:\Projects\MyEstimator\bin\Release).

2. Indexing will start automatically if settings are valid:

Indexing the Netflix dataset

This is a long operation... you can take a cup of coffee while it's importing. On my four years old P4 it takes around 40 minutes... a good place to optimize ;)

 3. The main UI will appear after the indexing is completed. Here you can check RMSE against the current estimator:

RMSE check in progress...

And, if results are good - generate a prediction set to submit:

Prediction set generation in progress...

On complete it will start Submission page at Netflix Prize site, and will open Explorer window with the Output directory (you specified it in the Settings):

Prediction set is ready to submit

output.txt.gz is the Prediction file, and md5.txt contains md5 hash string.

To submit the generated prediction set:

  1. Fill up your team info.
  2. Choose output.txt.gz in Prediction File field.
  3. Put content of md5.txt in MD5 Hash field.
  4. Hit Submit button.

Soon you should receive emails with submission results ;)

There are several things which are not implemented in the current version:

  • processing cannot be stopped
  • estimator cannot be reloaded, so you should quit and start the app again if you changed the estimator

 

 

Labels: , , , ,

Varozhka: Creating an estimator (part 3)

Overview

The same results can be achieved with Notepad and csc.exe from .NET Framework distribution, but I would recommend to use Visual Studio 2005 for that. I'm using VS2005 Standard Edition, but I'm pretty sure you can use any edition. If you don't have VS2005 installed - get your free copy of VS2005 Express Edition at http://msdn.microsoft.com/vstudio/express/.

Estimator creation

1. Run VS2005 and create a new Class Library project. Name the project as MyEstimator, and save it (say, to C:\Projects).

New Project dialog

2. Add references to Varozhka.Processing.dll and Varozhka.TrainingData.dll assemblies. These assemblies located at the directory where you extracted the Varozhka package (C:\Varozhka).

Add Reference dialog

3. Create a new class named as TestEstimator. Inherit the class from BaseEstimator class. It's an abstract class to set some constraints to descendant classes. Implement required constructor and override GetRating() method - this is a function where you have to implement your algorithm. For now let it return some constant value, say - 3.0.

The TestEstimator class should look something like:

    1 using System;
    2 using Varozhka.Processing;
    3 using Varozhka.TrainingData;
    4 
    5 namespace MyEstimator
    6 {
    7     public class TestEstimator : BaseEstimator
    8     {
    9         public TestEstimator(NetflixData netflixData) :
   10                                         base(netflixData)
   11         {
   12         }
   13 
   14         public override float GetRating(int movieId,
   15                                         int customerId,
   16                                         DateTime date)
   17         {
   18             return 3.0f;
   19         }
   20     }
   21 }

Basically, the dumb estimator is ready, and you can use it in estimations... but we will try to create something more complicated.

4. Let's create an estimator which will return average rating of the customer. We should sum up ratings of all rated movies, and divide it by number of ratings. Our estimator contains NetflixData property, which gives access to Netflix information. In the current implementation it allows to get:

  • movies rated by user (GetMoviesByCustomer() method)
  • customer/rating pairs for a movie (GetPacksByMovie() method)
  • customer who watched a movie (GetCustomersByMovie() method). This method is pretty slow in the current implementation, and I would not recommend to use it.
  • rating of the movie by the given customer (GetRating() method).

Straightforward implementation will be:

   14 public override float GetRating(int movieId,
   15                                 int customerId,
   16                                 DateTime date)
   17 {
   18     // get all movies watched by the customer
   19     short[] movies = NetflixData.GetMoviesByCustomer(customerId);
   20 
   21     // get sum of all ratings
   22     int sumRatings = 0;
   23     foreach (short movie in movies)
   24     {
   25         sumRatings += NetflixData.GetRating(movie, customerId);
   26     }
   27 
   28     // calculate average rating
   29     return (float)sumRatings / (float)movies.Length;
   30 }

 This implementation has several drawbacks. Say:

  • it will give wrong results for probe set, because we should ignore rating for movie with passed movieId.
  • it's a good idea to cache calculated movie ratings.

But the purpose of the sample is to show ability of indexes, so I'm ignoring these issues.

5. Compile the project in Release mode. Now we have working estimator, and it's ready to use in prediction generator ;)

Check the next post on how to generate predictions.

NOTE: the project is under development, so that will obsolete some day.

Labels: , , , ,

Varozhka: Installation (part 2)

Requirements
  • Processing is very memory-consuming because the Netflix dataset is big enough. You need at least 1GB of RAM.
  • ~3 GB of free space at HDD. Most of it is used by Netflix dataset.
  • Windows 2000/XP/2003/Vista. (checked on XP only. Please, let me know if you tried it on any other systems).
  • You need to have .NET Framework 2.0 installed. I guess everything (except the UI) can be migrated to Mono. Did not tried that though.
Preparations

First - download the archive with Netflix dataset. Extract it to some directory (say, c:\Netflix). Extract archive with movies (training_set.tar) to training_set subdirectory (c:\Netflix\training_set ) - this subdir is expected by indexer.

Installation
  1. Download the latest version of Varozhka from SF.net.
  2. Extract it to some directory (say, c:\Varozhka).

And now it's time to implement an estimator with your algorithm. Samples subdirectory (c:\Varozhka\Samples) contains two samples of simple estimators. You can view the code, or check walkthrough in the next post.

Labels: , , , ,

Varozhka: Introduction (part 1)

As you may know, Netflix organized a competition for systems predicting user ratings for movies.

I'm sure a lot of bright people have ideas on how to improve that, but do not have time to spend on it.

This project is a framework to automate most of the dirty work with the dataset. So you can concentrate on the prediction algorithm ;)

Current features:

  • No additional DB engine is required. All indexes are loaded in memory.
  • Abstract layer to play with data (this is a place to plug in).
  • Data access layer.
  • Easy way to check RMSE against the probe set.
  • Generation of submission dataset.

So, basically you can download the Netflix dataset, extract it to a directory, start a wizard (which do all import tasks), implement your own rating estimator, and use a wizard to submit results to Netflix.

The project named Varozhka (belarusian word for "fortune-teller"). It hosted at Google Code and SourceForge.Net.

This is an introductory post about the project. More details later...

NOTE: The project is under development, and most of the code is not optimized in any way.

 

Labels: , , , ,