Reputation: 573
I'm trying to use sklearn.decomposition.NMF
to a matrix R
that contains data on how users rated items to predict user ratings for items that they have not yet seen.
the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.
Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.
import numpy
R = numpy.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
])
from sklearn.decomposition import NMF
model = NMF(n_components=4)
A = model.fit_transform(R)
B = model.components_
n = numpy.dot(A, B)
print(n)
Problem is, that the model does not predict new values in place of 0
's, that would be the predicted scores, but instead recreates the matrix as was.
How do I get the model to predict user scores in place of my original matrix's zeros?
Upvotes: 3
Views: 4221
Reputation: 1274
pip install scikit-surprise
The docs and repo here https://github.com/NicolasHug/Surprise
Upvotes: 0
Reputation: 23129
sklearn
's implementation of NMF
does not seem to support missing values (Nan
s, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise
's NMF
implementation, as shown in the following code:
import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader
R = np.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
], dtype=np.float)
R[R==0] = np.nan
print(R)
# [[ 5. 3. nan 1.]
# [ 4. nan nan 1.]
# [ 1. 1. nan 5.]
# [ 1. nan nan 4.]
# [nan 1. 5. 4.]]
df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)
k = 2
algo = NMF(n_factors=k)
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
print(R_hat)
# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]
The NMF implementation is as per the [NMF:2014] paper as described here and shown below:
Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0
, as expected).
Again, as usual, we can find the number of factors k
using cross-validation.
Upvotes: 1
Reputation: 2816
That is what is supposed to happen.
However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.
So for instance considering 2 components
model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0. , 1.45512772],
[3.50429478, 1.32891458, 0. , 0.9701988 ],
[1.31294288, 0.94415991, 1.94956896, 3.94609389],
[0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
[0. , 0.65008935, 2.84003662, 5.21894555]])
You can see in this case that many of the previous zeros are now other numbers you could use. Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).
I think the question above is answered, but in case the complete procedure could be something as below.
For that we will need to know a the values in R that are real and we want to focus to predict.
In many cases 0 in R are those new cases / scenarios. It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components. For selection of they maybe a criteria or more to calculate the advantage in a test sample
Perhaps good to see:
Upvotes: 2