How pyspark implementation of ALS is handling multiple ratings per user-item combination?

Question

I observed that the input data to ALS need not have unique rating per user-item combination.

Here is a reproducible example.

# Sample Dataframe
df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0), 
(1, 1, 3.0), (1, 2, 4.0), 
(2, 1, 1.0), (2, 2, 5.0)],["user", "item", "rating"])

df.show(50,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |1   |2.0   |
|1   |1   |3.0   |
|1   |2   |4.0   |
|2   |1   |1.0   |
|2   |2   |5.0   |
+----+----+------+

As you can see that, each user-item combination have only one rating (an ideal scenario). If we pass this dataframe into ALS, it will give you the predictions like below:

# Fitting ALS
from pyspark.ml.recommendation import ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.9169915 |
|0   |1   |2.031506  |
|0   |2   |2.3546133 |
|1   |0   |4.9588947 |
|1   |1   |2.8347554 |
|1   |2   |4.003007  |
|2   |0   |0.9958025 |
|2   |1   |1.0896711 |
|2   |2   |4.895194  |
+----+----+----------+

Everything so far make sense to me. But what if we have a dataframe that contains multiple user-item rating combination like below -

# sample daataframe
df = spark.createDataFrame([(0, 0, 4.0), (0, 0, 3.5),
                            (0, 0, 4.1),(0, 1, 2.0),
                            (0, 1, 1.9),(0, 1, 2.1),
                            (1, 1, 3.0), (1, 1, 2.8),
                            (1, 2, 4.0),(1, 2, 3.6),
                            (2, 1, 1.0), (2, 1, 0.9),
                            (2, 2, 5.0),(2, 2, 4.9)],
                           ["user", "item", "rating"])
df.show(100,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0   |0   |4.0   |
|0   |0   |3.5   |
|0   |0   |4.1   |
|0   |1   |2.0   |
|0   |1   |1.9   |
|0   |1   |2.1   |
|1   |1   |3.0   |
|1   |1   |2.8   |
|1   |2   |4.0   |
|1   |2   |3.6   |
|2   |1   |1.0   |
|2   |1   |0.9   |
|2   |2   |5.0   |
|2   |2   |4.9   |
+----+----+------+

As you can see in above dataframe, there are multiple records of one user-item combination. For Example - user '0' has rated item '0' multiple times i.e. 4.0,3.5 and 4.1 respectively.

What if i pass this input dataframe to ALS? Will this work? I initially thought it should not work as ALS is supposed to get unique rating per user-item combination but when i ran this, it worked and surprised me!

# Fitting ALS
als = ALS(rank=5, 
          maxIter=5, 
          seed=0,
          regParam = 0.1,
         userCol='user',
         itemCol='item',
         ratingCol='rating',
         nonnegative=True)
model = als.fit(df)

# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)

predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0   |0   |3.7877638 |
|0   |1   |2.020348  |
|0   |2   |2.4364853 |
|1   |0   |4.9624424 |
|1   |1   |2.7311888 |
|1   |2   |3.8018093 |
|2   |0   |1.2490809 |
|2   |1   |1.0351425 |
|2   |2   |4.8451777 |
+----+----+----------+

Why did it work? I thought it will fail but it did not and giving me predictions as well.

I tried looking at research papers, limited source code of ALS and available information on internet but could not find anything useful. Is it taking average of these different ratings and then passing it to ALS or anything else?

Has anyone encountered similar thing before? Or any idea around how ALS is handling this kind of data internally?

How pyspark implementation of ALS is handling multiple ratings per user-item combination?

Answers (1)

Related Questions