mwhee
mwhee

Reputation: 682

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":

enter image description here

I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:

from pyspark.sql import functions as F
from pyspark.sql import Window

w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')

sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
           )\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))

but how do I create two columns in the same new dataframe?

w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')

sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))

enter image description here

Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976

Upvotes: 2

Views: 10326

Answers (1)

mwhee
mwhee

Reputation: 682

Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.

    w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')

    sorted_list_df = train_data.withColumn('spp_imp_daily', 
    F.collect_list('spp_imp_daily').over(w)
                                      )\
    .withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\

    .groupBy('Syscode_Stn')\
    .agg(F.max('spp_imp_daily').alias('spp_imp_daily'), 
     F.max('MarchMadInd').alias('MarchMadInd')
    )

Upvotes: 3

Related Questions