yokielove
yokielove

Reputation: 213

Combine multiple rows, with distinct value

I have a table like below:

ID Date         Class     Level
1  2021/01/01    math      1
1  2021/01/01    english   1
1  2021/01/01    history   1

My current code:

grouped_df = df\
.groupby('ID','Date')\
.agg(collect_list('class').alias("class"),collect_list('level').alias("level"))\
.withColumn("class", concat_ws(", ", "class"))\
.withColumn("level", concat_ws(", ", "level"))

and the code give me output look like this:

    ID Date         Class                       Level
    1  2021/01/01    math, english,history      1, 1, 1

and because I want to make the combine row sort, I use concat_ws, but the order of class is not sorted by desire, is there any way to sort it when doing the concat_ws()? I want to combine class to sort alphabetically. like English, history, math. but when I do the concat_ws, the output can be math_english,history, or history, math, English.

Is there a way to make the output like below:

  ID Date        Class                       Level
  1  2021/01/01  english,history,math        1

Upvotes: 0

Views: 194

Answers (2)

mck
mck

Reputation: 42332

You can use collect_set to remove duplicates:

grouped_df = df\
    .groupby('ID','Date')\
    .agg(collect_list('class').alias("class"),collect_set('level').alias("level"))\
    .withColumn("class", concat_ws(", ", "class"))\
    .withColumn("level", concat_ws(", ", "level"))

If there is always only one single level, you can also consider grouping by the level as well, e.g.

grouped_df = df\
    .groupby('ID','Date', 'level')\
    .agg(collect_list('class').alias("class"))\
    .withColumn("class", concat_ws(", ", "class"))

Edit: if you want to sort the array, you can use sort_array:

grouped_df = df\
    .groupby('ID','Date')\
    .agg(sort_array(collect_list('class')).alias("class"),collect_set('level').alias("level"))\
    .withColumn("class", concat_ws(", ", "class"))\
    .withColumn("level", concat_ws(", ", "level"))

Upvotes: 1

blackbishop
blackbishop

Reputation: 32640

To get unique values for level use collect_set and to order the class values you can't use array_sort with Spark 2.3, but you can use collect_list over an ordered window to get a sorted list instead of using UDF which generally leads to bad performances:

from pyspark.sql import Window
from pyspark.sql import functions as F


w = Window.partitionBy("ID", "Date").orderBy("Class")

grouped_df = df.withColumn("Class", F.collect_list("Class").over(w)) \
    .withColumn("Level", F.collect_set("Level").over(w)) \
    .groupBy("ID", "Date") \
    .agg(
    F.concat_ws(",", F.max("Class")).alias("Class"),
    F.concat_ws(",", F.max("Level")).alias("Level")
)

grouped_df.show(truncate=False)

# +---+----------+--------------------+-----+
# |ID |Date      |Class               |Level|
# +---+----------+--------------------+-----+
# |1  |2021/01/01|english,history,math|1    |
# +---+----------+--------------------+-----+

Upvotes: 0

Related Questions