Reputation: 6475
I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features>
. When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored instance and the original one. How can I solved this issue. After the scoring, I need to know the ids' and the score/predicted_label.
This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.
Upvotes: 1
Views: 314
Reputation: 63062
A solution to your problem is that the map
method of RDD retains order; therefore, you can use the RDD.zip
method with the id's.
Here is an answer that shows the procedure
Spark MLLib Kmeans from dataframe, and back again
It's very easy to obtain pairs of ids and clusters in form of RDD:
val idPointRDD = data.rdd.map(s => (s.getInt(0),
Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
Then you create DataFrame from that
val idCluster = idClusterRDD.toDF("id", "cluster")
It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.
Upvotes: 1