RandomTask
RandomTask

Reputation: 509

Mahout recommender returns no results for a user

I'm curious why in the example below the Mahout recommender isn't returning a recommendation for user 1.

My input file is below. I added blank lines to enhance readability. This file will need the blank lines removed before it's run through Mahout.

The columns in this file are:

User ID | item number | item rating

1 101 0
1 102 0
1 103 5
1 104 0

2 101 4
2 102 5
2 103 4
2 104 0

3 101 0
3 102 5
3 103 5
3 104 3

You'll note that item 103 is the only common item that all 3 users rated.

I ran: hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input small_data_set.txt --output small_data_set_output

The Mahout recommendation output file shows:

2 [104:4.5] 3 [101:5.0]

Which I believe means:

Is this correct?

Why isn't user 1 included in the recommendation output file? User 1 could have received a recommendation for Item 102 because user 2 and user 3 rated it. Is the data set too small?

Thanks in advance.

Upvotes: 1

Views: 763

Answers (1)

pferrel
pferrel

Reputation: 5702

Several mistakes may be present in your data, the first two here will cause undefined behavior:

  • IDs must be contiguous non-zero integers starting at 0 so you need to map your IDs above somehow. So your-user-ID = 1 will be a Mahout-user-ID = 0. The same for items, your-item-ID = 101 will be Mahout-user-ID = 0.
  • You should omit the 0 values from the input altogether if you mean that the user has expressed no preference, this makes the preference "undefined" in a sense. To do this omit the lines entirely.
  • Always use SIMILARITY_LOGLIKELIHOOD, it is widely measured as doing significantly better than the other methods unless you are trying to predict ratings, in that case use cosine.
  • If you use LLR similarity you should omit the values since they will be ignored.

There are very few uses for preference values unless you are trying to predict a user's rating for an item. The preference weights are useless in determining recommendation ranking, which is the typical thing to optimize. If you want to recommend the right things in the right order toss the values and use LLR.

The other thing that people sometimes do with values is show some weight of preference so 1 = a view of a product page and 5 = a product purchase. This will not work! I tried this with a large ecommerce dataset and found the recommendations were worse when adding in product views, even though there was 100 times more data. They are fundamentally different user actions with different user intent and so can't be mixed in this way.

If you really do want to mix different actions use the new multimodal recommender based on Mahout, Spark, and Solr described on the Mahout site here: It allows cross-cooccurrence type indicator calculations so you can use user location, likes and dislikes, view and purchase. Virtually the entire user clickstream can be used. But only with cross-cooccurrence correlating one action to the canonical "best" action, the one you want to recommend.

Upvotes: 2

Related Questions