how to keep records information when working in Mllib

Question

I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this . When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored instance and the original one. How can I solved this issue. After the scoring, I need to know the ids' and the score/predicted_label.

This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.

how to keep records information when working in Mllib

Answers (1)

Related Questions