How to combine Spark RDD and PairRDD in Java

Question

I have a data set with columns userId(String), itemId(int) and rating(int).

+----------+----------+---------+
| userId   |  itemId  |  rating |
+----------+----------+---------+
|  abc13   |    23    |    1    |
+----------+----------+---------+
|  qwe34   |    56    |    3    |
+----------+----------+---------+
|  qwe34   |    35    |    4    |
+----------+----------+---------+

I want to map the string userIds to unique long values. I tried to map the userIds using zipWithUniqueId() and it gives a pairRDD.

+------------+----------------+
|   userId   |  userIdMapped  |
+------------+----------------+
|    abc13   |        0       |   
+------------+----------------+
|    qwe34   |        1       |   
+------------+----------------+

I want to add the long values to another column and create the dataset as below:

+----------+----------+---------+----------------+
| userId   |  itemId  |  rating |  userIdMapped  |
+----------+----------+---------+----------------+
|  abc13   |    23    |    1    |       0        |
+----------+----------+---------+----------------+
|  qwe34   |    56    |    3    |       1        |
+----------+----------+---------+----------------+
|  qwe34   |    35    |    4    |       1        |
+----------+----------+---------+----------------+

I tried the following:

JavaRDD feedbackRDD = spark.read().jdbc(MYSQL_CONNECTION_URL, feedbackQuery, connectionProperties)
            .javaRDD().map(Feedback.mapFunc);
JavaPairRDD mappedPairRDD = feedbackRDD.map(new Function() {
    public String call(Feedback p) throws Exception {
        return p.getUserId();
    }).distinct().zipWithUniqueId();
Dataset feedbackDS = spark.createDataFrame(feedbackRDD, Feedback.class);
Dataset stringIds = spark.createDataset(zipped.keys().collect(), Encoders.STRING());
Dataset valueIds = spark.createDataset(zipped.values().collect(), Encoders.LONG());       
Dataset longIds = valueIds.withColumnRenamed("value", "userIdMapped");
Dataset userIdMap = intIds.join(stringIds);    
Dataset feedbackDSUserMapped = feedbackDS.join(userIdMap, feedbackDS.col("userId").equalTo(userIdMap.col("value")),
            "inner");
//Here 'value' column contains string user ids

The userIdMap dataset is joined incorrectly as below:

+-----------------+----------------+
|   userIdMapped  |     value      |
+-----------------+----------------+
|         0       |     abc13      |   
+-----------------+----------------+
|         0       |     qwe34      |   
+-----------------+----------------+
|         1       |     abc13      |   
+-----------------+----------------+
|         1       |     qwe34      |   
+-----------------+----------------+

Therefore the resulting feedbackDSUserMapped is wrong.

I'm new to Spark and I'm sure there must be a better way of doing this.

What is the best way to get the long value from pairRDD and set to relevant userId in the initial dataset(RDD)?

Any help would be much appreciated.

The data is to be used for ALS model.

Fleur · Accepted Answer

Solved it using StringIndexer:

StringIndexer indexer = new StringIndexer()
              .setInputCol("userId")
              .setOutputCol("userIdMapped");
Dataset userJoinedDataSet = indexer.fit(feedbackDS).transform(feedbackDS);

How to combine Spark RDD and PairRDD in Java

Answers (2)

Related Questions