Reputation: 11927
I have a PySpark dataframe like this,
+----------+------------+------------+------------+
| Name | dateCol1 | dateCol2 | dateCol3 |
+----------+------------+------------+------------+
| user1 | 2018-01-01 | 2018-01-10 | 2018-01-01 |
| user1 | 2018-01-11 | 2018-01-20 | 2018-01-01 |
| user2 | 2018-01-11 | 2018-01-20 | 2018-01-11 |
| user1 | 2019-01-21 | 2018-01-30 | 2018-01-01 |
+----------+------------+------------+------------+
I want to groupby this dataset on keys, dateCol1 and dateCol2 and so a collect_list over the column Name. For that I'm using the code,
spark_df.groupBy('dateCol1', 'dateCol2').agg(F.collect_list('Name'))
While collecting the column to list, I also want to maintain the order of the values based on the column dateCol3.
For instance, I want to make sure that for the dateCol1 == '2018-01-11'
and dateCol2 == '2018-01-20'
, collecting to list I will always get [user1, user2]
(based on the order of dateCol3).
The required output for the dataframe is,
+------------+------------+----------------+
| dateCol1 | dateCol2 | List |
+------------+------------+----------------+
| 2018-01-01 | 2018-01-10 | [user1] |
| 2018-01-11 | 2018-01-20 | [user1, user2] |
| 2019-01-21 | 2018-01-30 | [user1] |
+------------+------------+----------------+
collect_list won't maintain order by default, How make sure that the collected list is ordered based on another external column from the dataframe?
Upvotes: 1
Views: 4627
Reputation: 13397
You can try:
spark_df.orderBy('dateCol3', ascending=True).groupBy('dateCol1', 'dateCol2').agg(F.collect_list('Name'))
Alternatively, although it would be a bit of overkill you can use windowing:
from pyspark.sql import Window as w
spark_df.select('dateCol1', 'dateCol2', F.collect_list('Name').over(w.partitionBy(['dateCol1','dateCol2']).orderBy(F.col('dateCol3'))).alias('Name')).distinct()
Upvotes: 1