Meenu Varma
Meenu Varma

Reputation: 13

Pyspark select date1 col value closest to date2 value, preference given to date1 if it is less than date2, groupby ID

This is the logic in SQL:

coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt))

I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.

Input dataframe:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |

Expected Output:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |

Upvotes: 1

Views: 67

Answers (1)

mck
mck

Reputation: 42342

You can group by id and tran_date, then do a coalesce of conditional aggregations:

import pyspark.sql.functions as F

df2 = df.groupBy('id', 'tran_date').agg(
    F.coalesce(
        F.max(F.when(F.col('effc_date') < F.col('tran_date'), F.col('effc_date'))), 
        F.min(F.when(F.col('effc_date') >= F.col('tran_date'), F.col('effc_date')))
    ).alias('effc_date')
)

df2.show()
+---+----------+----------+
| id| tran_date| effc_date|
+---+----------+----------+
| 12|2020-02-01|2019-02-01|
| 34|2020-02-01|2020-02-15|
| 40|2020-02-01|2019-02-15|
+---+----------+----------+

Upvotes: 1

Related Questions