Reputation: 171
I observed that the input data to ALS need not have unique rating per user-item combination.
Here is a reproducible example.
# Sample Dataframe
df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0),
(1, 1, 3.0), (1, 2, 4.0),
(2, 1, 1.0), (2, 2, 5.0)],["user", "item", "rating"])
df.show(50,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0 |0 |4.0 |
|0 |1 |2.0 |
|1 |1 |3.0 |
|1 |2 |4.0 |
|2 |1 |1.0 |
|2 |2 |5.0 |
+----+----+------+
As you can see that, each user-item combination have only one rating (an ideal scenario). If we pass this dataframe into ALS, it will give you the predictions like below:
# Fitting ALS
from pyspark.ml.recommendation import ALS
als = ALS(rank=5,
maxIter=5,
seed=0,
regParam = 0.1,
userCol='user',
itemCol='item',
ratingCol='rating',
nonnegative=True)
model = als.fit(df)
# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)
predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0 |0 |3.9169915 |
|0 |1 |2.031506 |
|0 |2 |2.3546133 |
|1 |0 |4.9588947 |
|1 |1 |2.8347554 |
|1 |2 |4.003007 |
|2 |0 |0.9958025 |
|2 |1 |1.0896711 |
|2 |2 |4.895194 |
+----+----+----------+
Everything so far make sense to me. But what if we have a dataframe that contains multiple user-item rating combination like below -
# sample daataframe
df = spark.createDataFrame([(0, 0, 4.0), (0, 0, 3.5),
(0, 0, 4.1),(0, 1, 2.0),
(0, 1, 1.9),(0, 1, 2.1),
(1, 1, 3.0), (1, 1, 2.8),
(1, 2, 4.0),(1, 2, 3.6),
(2, 1, 1.0), (2, 1, 0.9),
(2, 2, 5.0),(2, 2, 4.9)],
["user", "item", "rating"])
df.show(100,0)
+----+----+------+
|user|item|rating|
+----+----+------+
|0 |0 |4.0 |
|0 |0 |3.5 |
|0 |0 |4.1 |
|0 |1 |2.0 |
|0 |1 |1.9 |
|0 |1 |2.1 |
|1 |1 |3.0 |
|1 |1 |2.8 |
|1 |2 |4.0 |
|1 |2 |3.6 |
|2 |1 |1.0 |
|2 |1 |0.9 |
|2 |2 |5.0 |
|2 |2 |4.9 |
+----+----+------+
As you can see in above dataframe, there are multiple records of one user-item combination. For Example - user '0' has rated item '0' multiple times i.e. 4.0,3.5 and 4.1 respectively.
What if i pass this input dataframe to ALS? Will this work? I initially thought it should not work as ALS is supposed to get unique rating per user-item combination but when i ran this, it worked and surprised me!
# Fitting ALS
als = ALS(rank=5,
maxIter=5,
seed=0,
regParam = 0.1,
userCol='user',
itemCol='item',
ratingCol='rating',
nonnegative=True)
model = als.fit(df)
# predictions from als
all_comb = df.select('user').distinct().join(broadcast(df.select('item').distinct()))
predictions = model.transform(all_comb)
predictions.show(20,0)
+----+----+----------+
|user|item|prediction|
+----+----+----------+
|0 |0 |3.7877638 |
|0 |1 |2.020348 |
|0 |2 |2.4364853 |
|1 |0 |4.9624424 |
|1 |1 |2.7311888 |
|1 |2 |3.8018093 |
|2 |0 |1.2490809 |
|2 |1 |1.0351425 |
|2 |2 |4.8451777 |
+----+----+----------+
Why did it work? I thought it will fail but it did not and giving me predictions as well.
I tried looking at research papers, limited source code of ALS and available information on internet but could not find anything useful. Is it taking average of these different ratings and then passing it to ALS or anything else?
Has anyone encountered similar thing before? Or any idea around how ALS is handling this kind of data internally?
Upvotes: 1
Views: 442
Reputation: 1
The parallelized method to matrix factorization implemented in Spark actually groups by (user,item) pairs and adds the different ratings a user made for the same item. You can verify this by yourself in the Scala Code in Spark's github, line 1377:
Implementation note: This implementation produces the same result as the following but
* generates fewer intermediate objects:
*
* {{{
* ratings.map { r =>
* ((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r)
* }.aggregateByKey(new RatingBlockBuilder)(
* seqOp = (b, r) => b.add(r),
* combOp = (b0, b1) => b0.merge(b1.build()))
* .mapValues(_.build())
where seqOp determines how to add up two rating objects. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
Upvotes: 0