id101112
id101112

Reputation: 1042

Subtracting two date columns in PySpark Python

I am trying to subtract two columns in PySpark Dataframe in Python I have got a number of problems doing it, I have column type as timestamp, the column is date1 = 2011-01-03 13:25:59 and want to subtract this from other date column date2 = 2011-01-03 13:27:00 so I want date2 - date1 and from those dataframe columns and making a seperate timediff column, which shows difference of these both columns such as timeDiff = 00:01:01

how can I do this in PySaprk

I tried the following code:

#timeDiff = df.withColumn(('timeDiff', col(df['date2']) - col(df['date1'])))

this code didn't work

I tried doing this simple thing:

timeDiff = df['date2'] - df['date1']

this actually worked, but after that I tried to add this seperate column to the my dataframe by the following piece of code

df = df.withColumn("Duration", timeDiff)

it is having the follwing error:

Py4JJavaError: An error occurred while calling o107.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`date2` - `date1`)' due to data type mismatch: '(`date2` - `date1`)' requires (numeric or calendarinterval) type, not timestamp;;

Any one can help me with any other method or how can I resolve this error ??

Upvotes: 3

Views: 10821

Answers (2)

Prem
Prem

Reputation: 11955

from pyspark.sql.functions import unix_timestamp

#sample data
df = sc.parallelize([
    ['2011-01-03 13:25:59', '2011-01-03 13:27:00'],
    ['2011-01-03 3:25:59',  '2011-01-03 3:30:00']
]).toDF(('date1', 'date2'))

timeDiff = (unix_timestamp('date2', "yyyy-MM-dd HH:mm:ss") - unix_timestamp('date1', "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("Duration", timeDiff)
df.show()

Output is:

+-------------------+-------------------+--------+
|              date1|              date2|Duration|
+-------------------+-------------------+--------+
|2011-01-03 13:25:59|2011-01-03 13:27:00|      61|
| 2011-01-03 3:25:59| 2011-01-03 3:30:00|     241|
+-------------------+-------------------+--------+

Upvotes: 8

Jane
Jane

Reputation: 61

Agreed on above answer and thanks!

But I think maybe need to change to:

timeDiff = (unix_timestamp(F.col('date2'), "yyyy-MM-dd HH:mm:ss") - unix_timestamp(F.col('date1'), "yyyy-MM-dd HH:mm:ss"))

Given

import pyspark.sql.functions as F

Upvotes: 0

Related Questions