Reputation: 9
I have this dataframe exemple
Date | Array_Date |
---|---|
2020-06-25 00:00:00 | [2021-12-15 00:00:00, 2022-02-05 00:00:00] |
2019-11-08 00:00:00 | 2018-03-01 00:00:00, 2015-07-30 00:00:00, 2022-05-18 00:00:00] |
I want to get the difference between Date and Array_Date in days in a new column (type array int days)
I'm trying to get this result
Date_diff |
---|
[538 , 590] |
[645 , 1562 , 922] |
Thanks
Upvotes: 1
Views: 170
Reputation: 9363
You can use transform which apply a function to each element in an array in a column. It is introduced in 3.1.0, so if you are using older Spark, you need to use Spark SQL's transform.
Pyspark 3.1.0+
from pyspark.sql import functions as F
df = df.withColumn('Date_diff',
F.transform('Array_Date', lambda x: F.abs(F.datediff(x, F.col('Date')))))
<Pyspark3.1.0
df = df.withColumn('Date_diff',
F.expr('transform(Array_Date, x -> abs(datediff(x, Date)))'))
Upvotes: 1
Reputation: 581
First, you need to use explode() function on the Array_Date
column so the date diff calculations can be done. After the calculations, you can just group by the Date
column to get the array of the calculations back. Like this:
from pyspark.sql.functions import unix_timestamp, datediff, array, col
df = df.withColumn('Date_diff', explode(col('Array_Date')))
df = df.withColumn('Date_diff',
datediff(
unix_timestamp(col('Date_diff'), 'yyyy-MM-dd HH:mm:ss'),
unix_timestamp(col('Date'), 'yyyy-MM-dd HH:mm:ss')
)
)
df = df.groupBy('Date').agg(array('Date_diff').alias('Date_diff'))
Upvotes: 0