Pyspark select date1 col value closest to date2 value, preference given to date1 if it is less than date2, groupby ID

Question

This is the logic in SQL:

coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt))

I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.

Input dataframe:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |

Expected Output:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |

mck · Accepted Answer

You can group by id and tran_date, then do a coalesce of conditional aggregations:

import pyspark.sql.functions as F

df2 = df.groupBy('id', 'tran_date').agg(
    F.coalesce(
        F.max(F.when(F.col('effc_date') < F.col('tran_date'), F.col('effc_date'))), 
        F.min(F.when(F.col('effc_date') >= F.col('tran_date'), F.col('effc_date')))
    ).alias('effc_date')
)

df2.show()
+---+----------+----------+
| id| tran_date| effc_date|
+---+----------+----------+
| 12|2020-02-01|2019-02-01|
| 34|2020-02-01|2020-02-15|
| 40|2020-02-01|2019-02-15|
+---+----------+----------+

Pyspark select date1 col value closest to date2 value, preference given to date1 if it is less than date2, groupby ID

Answers (1)

Related Questions