habibalsaki
habibalsaki

Reputation: 1132

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following

enter image description here

Recommendation column is array type, now I want to split this column, my final dataframe should look like this

enter image description here

Can anyone suggest me, which pyspark function can be used to form this dataframe?

Schema of the dataframe

root
 |-- person: string (nullable = false)
 |-- recommendation: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- rating: float (nullable = true)

Upvotes: 2

Views: 2659

Answers (1)

akuiper
akuiper

Reputation: 214957

Assuming ID doesn't duplicate in each array, you can try the following:

import pyspark.sql.functions as f

df.withColumn('recommendation', f.explode('recommendation'))\
    .withColumn('ID', f.col('recommendation').getItem('ID'))\
    .withColumn('rating', f.col('recommendation').getItem('rating'))\
    .groupby('person')\
    .pivot('ID')\
    .agg(f.first('rating')).show()

+------+---+---+---+
|person|  a|  b|  c|
+------+---+---+---+
|   xyz|0.4|0.3|0.3|
|   abc|0.5|0.3|0.2|
|   def|0.3|0.2|0.5|
+------+---+---+---+

Or transform with RDD:

df.rdd.map(lambda r: Row(
    person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()

+------+-------------------+-------------------+-------------------+
|person|                  a|                  b|                  c|
+------+-------------------+-------------------+-------------------+
|   abc|                0.5|0.30000001192092896|0.20000000298023224|
|   def|0.30000001192092896|0.20000000298023224|                0.5|
|   xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

Upvotes: 3

Related Questions