sparc
sparc

Reputation: 429

Logic with dates pyspark

I am using Pyspark and I have data like this in the dataframe

enter image description here

and I want the output like this

enter image description here

The logic goes like this - from table 1 above, the first date of category B for id=1 is 08/06/2022 and the first date for category A is 13/06/2022.So, for any dates on or after 13/06/2022 should have both categories A and B. So, for 08/06/2022, there is category B only and for 13/06/2022 there is category A and B. For 24/06/2022, there is just category A in table1 but the output should have category B too as the first date of category B is 13/06/2022 and for 26/07/2022, there is just category B in table 1 but output should have both category and category B for 26/07/2022.

How do I achieve this in pyspark?

Upvotes: 0

Views: 76

Answers (1)

samkart
samkart

Reputation: 6644

# input dataframe creation    
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'cat', 'dt']). \
    withColumn('dt', func.col('dt').cast('date'))

# required solution
data_sdf. \
    withColumn('min_dt', func.min('dt').over(wd.partitionBy('id'))). \
    withColumn('all_cats', func.collect_set('cat').over(wd.partitionBy('id'))). \
    withColumn('cat_arr', 
               func.when(func.col('min_dt') == func.col('dt'), func.array(func.col('cat'))).
               otherwise(func.col('all_cats'))
               ). \
    drop('cat', 'min_dt', 'all_cats'). \
    dropDuplicates(). \
    withColumn('cat', func.explode('cat_arr')). \
    drop('cat_arr'). \
    orderBy('id', 'dt', 'cat'). \
    show()

# +---+----------+---+
# |id |dt        |cat|
# +---+----------+---+
# |1  |2022-06-08|B  |
# |1  |2022-06-13|A  |
# |1  |2022-06-13|B  |
# |1  |2022-06-24|A  |
# |1  |2022-06-24|B  |
# +---+----------+---+

I've used a subset of the posted data. The idea of the approach is that you create an array of distinct categories and apply that to all dates except the minimum date. The minimum date will only have that row's category (not all categories). The array can then be exploded to get the desired result for all dates.

Upvotes: 1

Related Questions