Ivan Shelonik
Ivan Shelonik

Reputation: 2028

Spark ALS gives the same output

There is a need to create a little bit ensemble of Pyspark ALS Recommender Systems when I found that The factor matrices in ALS are initialized randomly firstly, so different runs will give slightly different results and using mean of them gives more accurate results. So I train model 2 times --> it gives me different model ALS objects but when using recommendForAllUsers() method gives for different models the same recommendation outputs. What is wrong here and Why is needed to restart script to get the different outputs even having different predicted ALS models?

P.S Seed parameter for pseudo random is absent.

def __train_model(ratings):
    """Train the ALS model with the current dataset
    """
    logger.info("Training the ALS model...")

    als = ALS(rank=rank, maxIter=iterations, implicitPrefs=True, regParam=regularization_parameter,
              userCol="order_id", itemCol="product_id", ratingCol="count")

    model = als.fit(ratings)

    logger.info("ALS model built!")

    return model


model1 = __train_model(ratings_DF)
print(model1)
sim_table_1 = model1.recommendForAllUsers(100).toPandas()

model2 = __train_model(ratings_DF)
print(model2)
sim_table_2 = model2.recommendForAllUsers(100).toPandas()

print('Equality of objects:', model1 == model2)

Output:

INFO:__main__:Training the ALS model...
INFO:__main__:ALS model built!
ALS_444a9e62eb6938248b4c
INFO:__main__:Training the ALS model...
INFO:__main__:ALS model built!
ALS_465c95728272696c6c67
Equality of objects: False

Upvotes: 1

Views: 888

Answers (1)

kshell
kshell

Reputation: 236

If you don't provide a value for the seed parameter when instantiating an ALS instance, it will default to the same value every time since it's a hash of the string ("ALS"). That's why your recommendation is always the same.

Code for setting default of seed:

self._setDefault(seed=hash(type(self).__name__))

Example:

from pyspark.ml.recommendation import ALS
als1 = ALS(rank=10, maxIter=5)
als2 = ALS(rank=10, maxIter=5)
als1.getSeed() == als2.getSeed() == hash("ALS")
>>> True

If you want to get a different model every time, you can use something like numpy.random.randint to generate a random integer for the seed.

Upvotes: 4

Related Questions