yokielove
yokielove

Reputation: 213

pivot one column into multiple columns in Pyspark/Python

I found a similiar situation with mine in this line, but he is using SQL server, not pyspark/python: Pivoting Multiple Columns Based On a Single Column

I have a dateset as below:

ID       Date             Class
1       2021/01/01        math, english
1       2021/01/02        math, english
1       2021/01/03        chinese
1       2021/01/04        math, chemistry
1       2021/01/05        math, english
1       2021/01/06        Chinese
2       2021/01/01        PE
2       2021/01/02        math, chinese
2       2021/01/03        math, english
2       2021/01/04        math, chinese
.......

the desire output should be:

ID       Date_1             schedule_1       Date_2       schedule_2     Date_3      schedule_3
1       2021/01/01        math, english      2021/01/03    chinese      2021/01/05   math, chemistry... 
1       2021/01/02        math, english      2021/01/06    chinese...
1       2021/01/05        math, english....
2       2021/01/01        PE                 2021/01/02    math, chinese     2021/01/03  math, english
2                                            2021/01/04    math, chinese

I am planning using pivot and groupby, this is my current code, which is not working, and I have no idea on how to solve it.

line2 = line\
.select("ID")\
.groupBy("ID","Class")\
    .pivot("Date","Class")\
    .agg(expr("coalesce(first(Class), \" \")"))

Any help or ideas or suggestions will be appreciated.

Upvotes: 0

Views: 242

Answers (1)

mck
mck

Reputation: 42422

A bit tricky and need some more wrangling:

import pyspark.sql.functions as F
from pyspark.sql import Window

df2 = df.withColumn(
    'rn',
    F.row_number().over(Window.partitionBy('ID', 'Class').orderBy('Date'))
).withColumn(
    'mindate',
    F.min('Date').over(Window.partitionBy('ID', 'Class'))
).withColumn(
    'rn2',
    F.dense_rank().over(Window.partitionBy('ID').orderBy('mindate'))
).groupBy('ID', 'rn').pivot('rn2').agg(
    F.first(F.struct('Date', 'Class'))
).orderBy('ID', 'rn')

df3 = df2.select(
    'ID',
    *[f'{c}.*' for c in df2.columns[2:]]
)

df3.show(truncate=False)
+---+----------+-------------+----------+-------------+----------+---------------+
|ID |Date      |Class        |Date      |Class        |Date      |Class          |
+---+----------+-------------+----------+-------------+----------+---------------+
|1  |2021/01/01|math, english|2021/01/03|chinese      |2021/01/04|math, chemistry|
|1  |2021/01/02|math, english|2021/01/06|chinese      |null      |null           |
|1  |2021/01/05|math, english|null      |null         |null      |null           |
|2  |2021/01/01|PE           |2021/01/02|math, chinese|2021/01/03|math, english  |
|2  |null      |null         |2021/01/04|math, chinese|null      |null           |
+---+----------+-------------+----------+-------------+----------+---------------+

Upvotes: 1

Related Questions