Reputation: 1256
I have a pyspark.sql.Dataframe, of the form
[[patient1, visit1, code1],
[patient1, visit1, code2],
[patient1, visit2, code3],
[patient1, visit2, code4]]
I'm trying to turn it into another Dataframe, using structs:
[[patient1, [visit1, [code1, code2],
visit2, [code3, code4]]]
What is the best way to do this?
Upvotes: 0
Views: 290
Reputation: 13387
Assuming column names - patient
, visit
, code
- you can do:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
res=(df
.groupBy(
f.col('patient'),
f.col('visit')
)
.agg(
f.collect_list(f.col('code')).alias('code')
)
.select(
f.col('patient'),
f.struct('visit', 'code').alias('_merged')
)
.groupBy(
f.col('patient')
).agg(
f.collect_list(f.col('_merged')).alias('_merged')
)
)
Upvotes: 1