Abhishek Mishra
Abhishek Mishra

Reputation: 21

spark sql- select record having least difference in 2 date columns

This is the logic in SQL:

coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt)).

I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.

Input dataframe:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |

Expected Output:

|id|tran_date  |effc_date  |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |

Upvotes: 0

Views: 179

Answers (1)

mck
mck

Reputation: 42352

You can order by the date difference and limit the results to 1 row:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rn', 
    F.row_number().over(
        Window.partitionBy('ID')
              .orderBy(F.datediff('start_date', 'end_date'))
    )
).filter('rn = 1').drop('rn')

df2.show()
+---+----------+----------+
| id|start_date|  end_date|
+---+----------+----------+
| 34|2021-02-01|2019-02-01|
| 12|2020-02-01|2019-02-01|
+---+----------+----------+

Upvotes: 2

Related Questions