Multiple levels of aggregation

Question

I have a pyspark.sql.Dataframe, of the form

[[patient1, visit1, code1],
 [patient1, visit1, code2],
 [patient1, visit2, code3],
 [patient1, visit2, code4]]

I'm trying to turn it into another Dataframe, using structs:

[[patient1, [visit1, [code1, code2],
             visit2, [code3, code4]]]

What is the best way to do this?

Georgina Skibinski · Accepted Answer

Assuming column names - patient, visit, code - you can do:

import pyspark.sql.functions as f
from pyspark.sql.functions import *

res=(df
    .groupBy(
        f.col('patient'),
        f.col('visit')
    )
    .agg(
        f.collect_list(f.col('code')).alias('code')
    )
    .select(
        f.col('patient'), 
        f.struct('visit', 'code').alias('_merged')
    )
    .groupBy(
        f.col('patient')
    ).agg(
        f.collect_list(f.col('_merged')).alias('_merged')
    )
)

Multiple levels of aggregation

Answers (1)

Related Questions