Goldfishy
Goldfishy

Reputation: 53

Evaluation of collaborative filtering algo using test set

With item-based collaborative filtering, we utilise item ratings of similar users to a given user to generate recommendations. Research has often suggested using a hold-out test set to evaluate the algorithm e.g. 20% of data with 80% for training. However, what if in the hold-out set all ratings of a certain item are held-out? Our training data will no longer contain that item and it will never be recommended.

E.g. 5 users each view 10 films, one of which is 'Titanic'. We randomly hold-out a test set of 20% of the data per user = 2 films/user. What if 'Titanic' is in the test set for each user? It will never be recommended.

Upvotes: 4

Views: 2625

Answers (2)

pferrel
pferrel

Reputation: 5702

The first answer is that this effect will be insignificant if the performance metric is averaging correctly. For this I always use MAP@k or mean average precision. This only measures the precision of your recommendations but does it in a way so that if some are missing the averages are still usually valid unless you have too little data.

As @Bartłomiej Twardowski says you can also do something like a k-fold test, that runs the evaluation on different splits and averages those. This is less prone to the small dataset issues and you can still use MAP@k as your metric since k-fold only addresses the split issue.

What we do is use MAP@k and split on a date to get 80 of older users in the training split and 20% of the newer users in the probe/test split. This mimics somewhat better how the real world recommender will work. Since there are often new users that come in after your model is built.

BTW don't forget that recommender "spread" is related to lift in conversions too. So recall is important. As a cheap and not very rigorous way to get recall we look at how many people in the hold-out set get recommendations. If you are comparing one tuning to another the recall is related to how many people can get recommendations but you have to use exactly the same split in both cases.

BTW2 Note that you are using ratings of users similar to a given user, this is user-based collaborative filtering. When you find items similar to some example item in terms of who rated it highly, you are doing item-based. The difference is whether the item or the user is the query.

One last plug for a new algorithm we use. To do both item and user based recs (as well as item-set recs like shopping cart) we use CCO (Correlated Cross-Occurrence) to take advantage of all user actions as input. In a blog post about this we found a 20%+ increase in MAP@k for a dataset collected from a movie review web site when we used user "likes" as well as "dislikes" to predict likes. The algorithm is implemented in Apache Mahout and a complete turnkey recommender is here.

Upvotes: 1

Bartłomiej Twardowski
Bartłomiej Twardowski

Reputation: 640

Evaluation methodology depends on use case and data type. So in some situations evaluating with randomized 80/20 split is not enough, i.e. when the time plays important role like session-based recommendations.

Assuming this use case can be evaluated in such manner, try not to base the evaluation only on a single random train/test split, but go for N-fold cross validation. In this case 5-fold cross validation with hold-out. The evaluation outcome will be aggregated result from all folds. Going further, this single experiment can be repeated a few times in order to get the final outcome.

Check out this two project:

both can be useful for you. At least in looking for a proper evaluation methodology.

Upvotes: 3

Related Questions