Reputation: 1132
After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Upvotes: 2
Views: 2659
Reputation: 214957
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+
Upvotes: 3