matrix factorization model returning much smaller dataframe after predicting ratings in pyspark

Question

I'm trying to create a product recommender with the code below. I'm using matrix factorization from spark ml. I have data that has a customer_id, product_id, and a numeric rating value that has been normalized. So all rating values are between 0 and 1. My dataset has almost 10M records with no null product_id or customer_id. Yet after I train the model and then try to make predictions for the entire dataset, I'm getting back a very small number of predictions. I'm getting similar size test_pred dataframe returned by the model whether I set the coldStartStrategy to "nan" or "drop". Does anyone see what the issue may be and can you suggest how to fix?

code:

print(data.count())

print(data[['customer_id']].distinct().count())

print(data[['product_id']].distinct().count())

output:

9943626
1715292
308792

code:

from pyspark.ml.recommendation import ALS

(training, test) = data.randomSplit([0.7, 0.3])

als = ALS(implicitPrefs=True,
          maxIter=5, 
          regParam=0.01, 
          userCol="customer_id_index", 
          itemCol="product_id_index", 
          ratingCol="rating",
         coldStartStrategy="nan")
model = als.fit(training)



test_pred=model.transform(data)

print(test_pred.count())

print(test_pred[['customer_id']].distinct().count())

print(test_pred[['product_id']].distinct().count())

output:

3346
522
760

matrix factorization model returning much smaller dataframe after predicting ratings in pyspark

Answers (1)

Related Questions