Reputation: 99
I am trying to build a content-based recommender system in python/pandas/numpy/sklearn.
Here are the matrix involved and their size:
X: n_customers * n_features (contains the features of each customer)
Y: n_customers *n_products (contains the scores given by each customer to each product)
Theta: n_features * n_products
The aim is to learn Theta in order to be able to predict the score given by a customer to all products (X*Theta). Indeed, Y is a sparse matrix, a customer score only a very small % of the whole quantity of products. This is why Y contains a lot of NaN values.
Here is my problem:
This is a regression problem with many targets (here target=product). But I want to do the regression only on not null values. because the number of NaN differ from one product to another, how can I vectorize that ?
Assume there are 1000 products and 100 000 customers, each one having 20 features.
For each product I need to the regression on the not null values. So without vectorization, I would need 1000 different regressor learning each one a Theta vector of length 20.
If possible I would like to solve this problem with sklearn. The ridge regression for example takes into account multiple targets (Y as a matrix)
I hope it's clear enough.
Thank you for your help.
Upvotes: 3
Views: 3777
Reputation: 3092
I believe You can use centered cosine similarity /pearson corelation to make this work and make use of collaborative filtering technique to achieve this
Before you use pearson co -relation you need to fill the Null ( the fields which dont have any entries) with zero ,now pearson co relation centers the similarity matrix around zero ,which gives optimum recommendation .
Upvotes: 1