Use Cosine Similarity with Binary Data - Mahout

Question

I have a boolean/binary where a customer and product id are found when the customer actually bought the product and not found if the customer did not buy it. The dataset represented like this:

Dataset

I have tried different approaches like GenericBooleanPrefUserBasedRecommender with TanimotoCoefficient or LogLikelihood similarities, but I have also tried GenericUserBasedRecommender with the Uncentered Cosine Similarity and it gave me the highest precision and recall 100% and 60% respectively.

I am not sure if it makes sense to use the Uncentered Cosine Similarity in this situation, or this is a wrong logic ? and what does the Uncentered Cosine Similairty do with such dataset.

Any ideas would be really appreciated.

Thank you.

pferrel · Accepted Answer

100% precision is impossible so something is wrong. All the similarity metrics work fine with boolean data. Remember the space is of very high dimensionality.

Your sample data only has two items (BTW ids should be 0 based for the old hadoop version of Mahout). So the dataset as shown is not going to give valid precision scores.

I've done this with large E-Com datasets and Log-likelihood considerably out-performs the other metrics on boolean data.

BTW Mahout has moved on to Spark from Hadoop and our only metric is LLR. A full Universal Recommender with event store and prediction server based on Mahout-Samsara is implemented here: http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation Slides describing it here: http://www.slideshare.net/pferrel/unified-recommender-39986309

Use Cosine Similarity with Binary Data - Mahout

Answers (1)

Related Questions