How to use the implementation of Spark ML ALS that supports generic ID types (int and long)

Question

I am trying to build a recommendation system by using Spark ML ALS where data are as follows

"User-ID";"ISBN "; "Book-Rating"
276725;034545104;0
276726;0155061224;5
276727;0446520802;0
276729;052165615;3
276729;0521795028;6

I am using Spark 2.1.0 and mongoldb to load data. Here is my piece of code that defines the dataframe and his schema after casting.

/*
 *  Chargement de données de rating
 */

val dfrating = spark.loadFromMongoDB(readConfig) 

val bookRatings = dfrating.selectExpr("cast(User_ID as Long) User_ID " ,"cast(ISBN as Long) ISBN ", "Book_Rating")

bookRatings.printSchema()

val als = new ALS().setMaxIter(10).setRegParam(0.01).setUserCol("User_ID").setItemCol("ISBN").setRatingCol("Book_Rating")
val model = als.fit(training)

After compiling, I have got

root
 |-- User_ID: long (nullable = true)
 |-- ISBN: long (nullable = true)
 |-- Book_Rating: integer (nullable = true)

+-------+----------+-----------+
|User_ID|      ISBN|Book_Rating|
+-------+----------+-----------+
|    215|  61030147|          6|
|   5750|1853260045|          0|
|  11676| 743244249|          0|
|  11676|1551665700|          0|

Caused by: java.lang.IllegalArgumentException: **ALS **only supports values in Integer range for column**s User_ID and ISBN. ****Value** 8.477024456E9 **was out of Integer range.******
    at org.apache.spark.ml.recommendation.ALSModelParams$$anonfun$1.apply$mcID$sp(ALS.scala:87)

Is there any other solution to get things running? I have got these suggestions (How to use mllib.recommendation if the user ids are string instead of contiguous integers? How to use long user ID in PySpark ALS and also Non-integer ids in Spark MLlib ALS) for the same problem, but I don't know how to begin.

Here is what I do.

val isbn_als = new StringIndexer()
      .setHandleInvalid("skip")
      .setInputCol("ISBN")
      .setOutputCol("ISBN_als")
      .fit(uRatings)

val isbn_als_reverse = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")

val als = new    ALS().setMaxIter(10).setRegParam(0.01).setUserCol("User_ID").setItemCol("ISBN_als").setRatingCol("Book_Rating")
 
     /*
      *  On définit l'ordre des opérations à effectuer
      */
 
     println("On passe au Pipeline")

     val alsPipeline = new Pipeline().setStages(Array(isbn_als, als, isbn_als_reverse))
 
     /*
      *  On construit le modèle de recommandation à partir des données de Training
      */
 
     println("On passe à la construction du modèle")

     val alsModel = alsPipeline.fit(training)
 

     /*
      *  On exécute le modèle sur les données de Test, puis on affiche un échantillon de prédictions
      */
 
     println("On exécute le modèle sur les données de Test")
 
 
     val alsPredictions = alsModel.transform(test).na.drop()


     println("Affichage des prédictions")

     alsPredictions.select($"User_ID",$"ISBN", $"Book_Rating", $"prediction").show(20)

But I have got this exception when I use IndexToString() on the pipeline.

On passe au Pipeline
On passe à la construction du modèle
On exécute le modèle sur les données de Test
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.*attribute.NominalAttribute*
    at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:313)
    at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305)
    at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)

When I do not use IndexToString(), I have got a negative prediction.

+-------+---------+-----------+-------------+
|User_ID|     ISBN|Book_Rating|   prediction|
+-------+---------+-----------+-------------+
| 140340|786881852|         10|    6.9798374|
| 127327|786881852|          0|-1.2718141E-4|
| 103336|786881852|          0|    1.2374072|
| 138578|786881852|          9|     8.200257|
| 172742|786881852|          0|   -1.3278971|
|  31909|786881852|          6|     5.997123|
|  69554|786881852|          5|     2.819587|
| 173650|786881852|          0|   0.42850634|

I suppose the negative prediction is due to IndexToString() that is not used. If so, how to use IndexToString() on the pipeline?

How to use the implementation of Spark ML ALS that supports generic ID types (int and long)

Answers (1)

Related Questions