Reputation: 213
I have a table like below:
ID Date Class Level
1 2021/01/01 math 1
1 2021/01/01 english 1
1 2021/01/01 history 1
My current code:
grouped_df = df\
.groupby('ID','Date')\
.agg(collect_list('class').alias("class"),collect_list('level').alias("level"))\
.withColumn("class", concat_ws(", ", "class"))\
.withColumn("level", concat_ws(", ", "level"))
and the code give me output look like this:
ID Date Class Level
1 2021/01/01 math, english,history 1, 1, 1
and because I want to make the combine row sort, I use concat_ws
, but the order of class is not sorted by desire, is there any way to sort it when doing the concat_ws()
? I want to combine class
to sort alphabetically. like English, history, math
. but when I do the concat_ws
, the output can be math_english,history,
or history, math, English
.
Is there a way to make the output like below:
ID Date Class Level
1 2021/01/01 english,history,math 1
Upvotes: 0
Views: 194
Reputation: 42332
You can use collect_set
to remove duplicates:
grouped_df = df\
.groupby('ID','Date')\
.agg(collect_list('class').alias("class"),collect_set('level').alias("level"))\
.withColumn("class", concat_ws(", ", "class"))\
.withColumn("level", concat_ws(", ", "level"))
If there is always only one single level, you can also consider grouping by the level as well, e.g.
grouped_df = df\
.groupby('ID','Date', 'level')\
.agg(collect_list('class').alias("class"))\
.withColumn("class", concat_ws(", ", "class"))
Edit: if you want to sort the array, you can use sort_array
:
grouped_df = df\
.groupby('ID','Date')\
.agg(sort_array(collect_list('class')).alias("class"),collect_set('level').alias("level"))\
.withColumn("class", concat_ws(", ", "class"))\
.withColumn("level", concat_ws(", ", "level"))
Upvotes: 1
Reputation: 32640
To get unique values for level
use collect_set
and to order the class
values you can't use array_sort
with Spark 2.3, but you can use collect_list
over an ordered window to get a sorted list instead of using UDF which generally leads to bad performances:
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy("ID", "Date").orderBy("Class")
grouped_df = df.withColumn("Class", F.collect_list("Class").over(w)) \
.withColumn("Level", F.collect_set("Level").over(w)) \
.groupBy("ID", "Date") \
.agg(
F.concat_ws(",", F.max("Class")).alias("Class"),
F.concat_ws(",", F.max("Level")).alias("Level")
)
grouped_df.show(truncate=False)
# +---+----------+--------------------+-----+
# |ID |Date |Class |Level|
# +---+----------+--------------------+-----+
# |1 |2021/01/01|english,history,math|1 |
# +---+----------+--------------------+-----+
Upvotes: 0