Reputation: 21
This is the logic in SQL:
coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt)).
I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.
Input dataframe:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |
Expected Output:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
Upvotes: 0
Views: 179
Reputation: 42352
You can order by the date difference and limit the results to 1 row:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy('ID')
.orderBy(F.datediff('start_date', 'end_date'))
)
).filter('rn = 1').drop('rn')
df2.show()
+---+----------+----------+
| id|start_date| end_date|
+---+----------+----------+
| 34|2021-02-01|2019-02-01|
| 12|2020-02-01|2019-02-01|
+---+----------+----------+
Upvotes: 2