Groupby and collect_list maintaining order based on another column in PySpark

Question

I have a PySpark dataframe like this,

+----------+------------+------------+------------+
|   Name   |  dateCol1  |  dateCol2  |  dateCol3  |
+----------+------------+------------+------------+
| user1    | 2018-01-01 | 2018-01-10 | 2018-01-01 |
| user1    | 2018-01-11 | 2018-01-20 | 2018-01-01 |
| user2    | 2018-01-11 | 2018-01-20 | 2018-01-11 |
| user1    | 2019-01-21 | 2018-01-30 | 2018-01-01 |
+----------+------------+------------+------------+

I want to groupby this dataset on keys, dateCol1 and dateCol2 and so a collect_list over the column Name. For that I'm using the code,

spark_df.groupBy('dateCol1', 'dateCol2').agg(F.collect_list('Name'))

While collecting the column to list, I also want to maintain the order of the values based on the column dateCol3.

For instance, I want to make sure that for the dateCol1 == '2018-01-11' and dateCol2 == '2018-01-20', collecting to list I will always get [user1, user2] (based on the order of dateCol3).

The required output for the dataframe is,

+------------+------------+----------------+
|  dateCol1  |  dateCol2  |    List        |
+------------+------------+----------------+
| 2018-01-01 | 2018-01-10 | [user1]        |
| 2018-01-11 | 2018-01-20 | [user1, user2] |
| 2019-01-21 | 2018-01-30 | [user1]        |
+------------+------------+----------------+

collect_list won't maintain order by default, How make sure that the collected list is ordered based on another external column from the dataframe?

Georgina Skibinski · Accepted Answer

You can try:

spark_df.orderBy('dateCol3', ascending=True).groupBy('dateCol1', 'dateCol2').agg(F.collect_list('Name'))

Alternatively, although it would be a bit of overkill you can use windowing:

from pyspark.sql import Window as w

spark_df.select('dateCol1', 'dateCol2', F.collect_list('Name').over(w.partitionBy(['dateCol1','dateCol2']).orderBy(F.col('dateCol3'))).alias('Name')).distinct()

Groupby and collect_list maintaining order based on another column in PySpark

Answers (1)

Related Questions