Reputation: 13
This is the logic in SQL:
coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt))
I want similar logic in Pyspark, when effc date
is lesser than tran date
it will select effc date
closest to tran date
and if lesser date is not present it will check for greater and select effc date
closest to tran date
.
Input dataframe:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |
Expected Output:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
Upvotes: 1
Views: 67
Reputation: 42342
You can group by id and tran_date, then do a coalesce of conditional aggregations:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'tran_date').agg(
F.coalesce(
F.max(F.when(F.col('effc_date') < F.col('tran_date'), F.col('effc_date'))),
F.min(F.when(F.col('effc_date') >= F.col('tran_date'), F.col('effc_date')))
).alias('effc_date')
)
df2.show()
+---+----------+----------+
| id| tran_date| effc_date|
+---+----------+----------+
| 12|2020-02-01|2019-02-01|
| 34|2020-02-01|2020-02-15|
| 40|2020-02-01|2019-02-15|
+---+----------+----------+
Upvotes: 1