Mario
Mario

Reputation: 573

How to use sklearn's Matrix factorization to predict new users' recommendation scores

I'm trying to use sklearn.decomposition.NMF to a matrix R that contains data on how users rated items to predict user ratings for items that they have not yet seen.

the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.

Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.

import numpy

R = numpy.array([
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ])

from sklearn.decomposition import NMF
model = NMF(n_components=4)

A = model.fit_transform(R)
B = model.components_

n = numpy.dot(A, B)
print(n)

Problem is, that the model does not predict new values in place of 0's, that would be the predicted scores, but instead recreates the matrix as was.

How do I get the model to predict user scores in place of my original matrix's zeros?

Upvotes: 3

Views: 4221

Answers (3)

julianhatwell
julianhatwell

Reputation: 1274

pip install scikit-surprise

The docs and repo here https://github.com/NicolasHug/Surprise

Upvotes: 0

Sandipan Dey
Sandipan Dey

Reputation: 23129

sklearn's implementation of NMF does not seem to support missing values (Nans, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise's NMF implementation, as shown in the following code:

import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader

R = np.array([
     [5,3,0,1],
     [4,0,0,1],
     [1,1,0,5],
     [1,0,0,4],
     [0,1,5,4],
    ], dtype=np.float)

R[R==0] = np.nan
print(R)

# [[ 5.  3. nan  1.]
#  [ 4. nan nan  1.]
#  [ 1.  1. nan  5.]
#  [ 1. nan nan  4.]
#  [nan  1.  5.  4.]]

df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)

k = 2
algo = NMF(n_factors=k) 
trainset = data.build_full_trainset() 
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
    R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
    R_hat[uid, iid] = est
print(R_hat)

# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]

The NMF implementation is as per the [NMF:2014] paper as described here and shown below:

enter image description here

Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0, as expected).

Again, as usual, we can find the number of factors k using cross-validation.

Upvotes: 1

Rafael Valero
Rafael Valero

Reputation: 2816

That is what is supposed to happen.

However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.

So for instance considering 2 components

model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0.        , 1.45512772],
       [3.50429478, 1.32891458, 0.        , 0.9701988 ],
       [1.31294288, 0.94415991, 1.94956896, 3.94609389],
       [0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
       [0.        , 0.65008935, 2.84003662, 5.21894555]])

You can see in this case that many of the previous zeros are now other numbers you could use. Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).

How to select n_components?

I think the question above is answered, but in case the complete procedure could be something as below.

For that we will need to know a the values in R that are real and we want to focus to predict.

In many cases 0 in R are those new cases / scenarios. It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components. For selection of they maybe a criteria or more to calculate the advantage in a test sample

  1. Create R_with_Averages
  2. Model selection: 2.1) Split R_with_Averages Test and Training 2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R) 2.3) Select the best model --> best n_components
  3. Predict with the best model.

Perhaps good to see:

Upvotes: 2

Related Questions