Reputation: 1042
I am trying to subtract two columns in PySpark Dataframe
in Python I have got a number of problems doing it, I have column type as timestamp, the column is date1 = 2011-01-03 13:25:59
and want to subtract this from other date column date2 = 2011-01-03 13:27:00
so I want date2 - date1
and from those dataframe columns and making a seperate timediff column, which shows difference of these both columns such as timeDiff = 00:01:01
how can I do this in PySaprk
I tried the following code:
#timeDiff = df.withColumn(('timeDiff', col(df['date2']) - col(df['date1'])))
this code didn't work
I tried doing this simple thing:
timeDiff = df['date2'] - df['date1']
this actually worked, but after that I tried to add this seperate column to the my dataframe by the following piece of code
df = df.withColumn("Duration", timeDiff)
it is having the follwing error:
Py4JJavaError: An error occurred while calling o107.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`date2` - `date1`)' due to data type mismatch: '(`date2` - `date1`)' requires (numeric or calendarinterval) type, not timestamp;;
Any one can help me with any other method or how can I resolve this error ??
Upvotes: 3
Views: 10821
Reputation: 11955
from pyspark.sql.functions import unix_timestamp
#sample data
df = sc.parallelize([
['2011-01-03 13:25:59', '2011-01-03 13:27:00'],
['2011-01-03 3:25:59', '2011-01-03 3:30:00']
]).toDF(('date1', 'date2'))
timeDiff = (unix_timestamp('date2', "yyyy-MM-dd HH:mm:ss") - unix_timestamp('date1', "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("Duration", timeDiff)
df.show()
Output is:
+-------------------+-------------------+--------+
| date1| date2|Duration|
+-------------------+-------------------+--------+
|2011-01-03 13:25:59|2011-01-03 13:27:00| 61|
| 2011-01-03 3:25:59| 2011-01-03 3:30:00| 241|
+-------------------+-------------------+--------+
Upvotes: 8
Reputation: 61
Agreed on above answer and thanks!
But I think maybe need to change to:
timeDiff = (unix_timestamp(F.col('date2'), "yyyy-MM-dd HH:mm:ss") - unix_timestamp(F.col('date1'), "yyyy-MM-dd HH:mm:ss"))
Given
import pyspark.sql.functions as F
Upvotes: 0