Pivot and Concatenate columns in pyspark dataframe

Question

I have this dataframe below, and I need to get basically one row with all the marks fields concatenated with a delimiter like pipe.
So: PACKAGING MARKS 3|PACKAGING MARKS 2|PACKAG.....

And there can be varying amounts of marks records for each mid.

mid	marksId	id	index	marks
2	3	3	2	PACKAGING MARKS 3
2	3	3	1	PACKAGING MARKS 2
2	3	3	0	PACKAGING MARKS 1
2	4	4	2	PACKAGING MARKS 23
2	4	4	1	PACKAGING MARKS 22
2	4	4	0	PACKAGING MARKS 21

Thanks

bzu · Accepted Answer

Assuming you want 1 delimited string for each "mid", you can collect all "marks" with collect_list() and use concat_ws() to create the string:

import pyspark.sql.functions as F

df.groupby('mid').agg(F.concat_ws('|', F.collect_list('marks')).alias('marks_str')).show(truncate=False)

Pivot and Concatenate columns in pyspark dataframe

Answers (1)

Related Questions