Pyspark : date_diff between timestamp and array column timestamp

Question

I have this dataframe exemple

Date	Array_Date
2020-06-25 00:00:00	[2021-12-15 00:00:00, 2022-02-05 00:00:00]
2019-11-08 00:00:00	2018-03-01 00:00:00, 2015-07-30 00:00:00, 2022-05-18 00:00:00]

I want to get the difference between Date and Array_Date in days in a new column (type array int days)

I'm trying to get this result

Date_diff
[538 , 590]
[645 , 1562 , 922]

Thanks

Emma · Accepted Answer

You can use transform which apply a function to each element in an array in a column. It is introduced in 3.1.0, so if you are using older Spark, you need to use Spark SQL's transform.

Pyspark 3.1.0+

from pyspark.sql import functions as F

df = df.withColumn('Date_diff', 
                   F.transform('Array_Date', lambda x: F.abs(F.datediff(x, F.col('Date')))))

df = df.withColumn('Date_diff', F.expr('transform(Array_Date, x -> abs(datediff(x, Date)))'))

Pyspark : date_diff between timestamp and array column timestamp

Answers (2)

Related Questions