Pyspark, how to convert the raw data into SVMLight format

Question

I have one question regarding the Pyspark map.

For exampel, I have data as follow:

 data=[(1,1,1,10),(1,1,2,20),(2,1,3,15),(2,1,1,47),(3,0,2,28),(3,0,3,17)]
 df=spark.createDataFrame(data).toDF("ID","Target","features","value1")
 df.show()

 +---+------+--------+------+
 | ID|Target|features|value1|
 +---+------+--------+------+
 |  1|     1|       1|    10|
 |  1|     1|       2|    20|
 |  2|     1|       3|    15|
 |  2|     0|       1|    47|
 |  3|     0|       2|    28|
 |  3|     0|       3|    17|
 +---+------+--------+------+

I want to convert the data looks as: group by ID:

 1 1:10  2:20
 1 2:15  1:47
 0 2:28  3:17

So each line represent on ID, and the first value represent the Target, and features:value1

Could you provide any sample code or suggestions?

Thank you so much!!!!!!!!!!!

Pyspark, how to convert the raw data into SVMLight format

Answers (1)

Related Questions