BING
BING

Reputation: 25

Pyspark, how to convert the raw data into SVMLight format

I have one question regarding the Pyspark map.

For exampel, I have data as follow:

 data=[(1,1,1,10),(1,1,2,20),(2,1,3,15),(2,1,1,47),(3,0,2,28),(3,0,3,17)]
 df=spark.createDataFrame(data).toDF("ID","Target","features","value1")
 df.show()

 +---+------+--------+------+
 | ID|Target|features|value1|
 +---+------+--------+------+
 |  1|     1|       1|    10|
 |  1|     1|       2|    20|
 |  2|     1|       3|    15|
 |  2|     0|       1|    47|
 |  3|     0|       2|    28|
 |  3|     0|       3|    17|
 +---+------+--------+------+

I want to convert the data looks as: group by ID:

 1 1:10  2:20
 1 2:15  1:47
 0 2:28  3:17

So each line represent on ID, and the first value represent the Target, and features:value1

Could you provide any sample code or suggestions?

Thank you so much!!!!!!!!!!!

Upvotes: 1

Views: 259

Answers (1)

werner
werner

Reputation: 14845

You can group the data by ID (and maybe also by Target?), collect each group into a list and then use a combination of transform and concat_ws to format each list into the required format:

from pyspark.sql import functions as F

df = spark.createDataFrame(data).toDF("ID","Target","features","value1") \
    .groupBy("ID", "Target").agg(F.collect_list(F.struct("features", "value1")).alias("feature_value")) \
    .withColumn("feature_value", F.expr("transform(feature_value, x -> concat_ws(':',x.features, x.value1))")) \
    .withColumn("feature_value", F.concat_ws(" ", F.col("feature_value"))) \
    .withColumn("result", F.concat_ws(" ", F.col("Target"), F.col("feature_value"))) \
    .select("result")

Result:

+-----------+
|     result|
+-----------+
|0 2:28 3:17|
|1 1:10 2:20|
|1 3:15 1:47|
+-----------+

Upvotes: 1

Related Questions