kakdim
kakdim

Reputation: 9

Pyspark : date_diff between timestamp and array column timestamp

I have this dataframe exemple

Date Array_Date
2020-06-25 00:00:00 [2021-12-15 00:00:00, 2022-02-05 00:00:00]
2019-11-08 00:00:00 2018-03-01 00:00:00, 2015-07-30 00:00:00, 2022-05-18 00:00:00]

I want to get the difference between Date and Array_Date in days in a new column (type array int days)

I'm trying to get this result

Date_diff
[538 , 590]
[645 , 1562 , 922]

Thanks

Upvotes: 1

Views: 170

Answers (2)

Emma
Emma

Reputation: 9363

You can use transform which apply a function to each element in an array in a column. It is introduced in 3.1.0, so if you are using older Spark, you need to use Spark SQL's transform.

Pyspark 3.1.0+

from pyspark.sql import functions as F

df = df.withColumn('Date_diff', 
                   F.transform('Array_Date', lambda x: F.abs(F.datediff(x, F.col('Date')))))

<Pyspark3.1.0

df = df.withColumn('Date_diff',
                   F.expr('transform(Array_Date, x -> abs(datediff(x, Date)))'))

Upvotes: 1

ms12
ms12

Reputation: 581

First, you need to use explode() function on the Array_Date column so the date diff calculations can be done. After the calculations, you can just group by the Date column to get the array of the calculations back. Like this:

from pyspark.sql.functions import unix_timestamp, datediff, array, col

df = df.withColumn('Date_diff', explode(col('Array_Date')))
df = df.withColumn('Date_diff', 
    datediff(
        unix_timestamp(col('Date_diff'), 'yyyy-MM-dd HH:mm:ss'), 
        unix_timestamp(col('Date'), 'yyyy-MM-dd HH:mm:ss')
    )
)

df = df.groupBy('Date').agg(array('Date_diff').alias('Date_diff'))

Upvotes: 0

Related Questions