ajb
ajb

Reputation: 31689

Difference of two TimestampTypes in pyspark

Using pyspark, I have a data frame with two TimestampType columns:

df.schema
...StructField(session_start,TimestampType,true),StructField(session_end,TimestampType,true)...

But I can't figure out how to compute the difference:

df2 = df.withColumn("session_length",col("session_end")-col("session_start"))

gives me

AnalysisException: u"cannot resolve '(`session_end` - `session_start`)' due to data type mismatch: '(`session_end` - `session_start`)' requires (numeric or calendarinterval) type, not timestamp ...

I haven't found an alternative that works. (There's a datediff function, but that returns a result in days, and I need the difference in seconds.)

How should I write this?

EDIT: The original version of this question had a different error caused by reusing col as a variable in my notebook. After redoing the import to get the function back, I now get the above AnalysisException.

Upvotes: 5

Views: 4312

Answers (1)

pault
pault

Reputation: 43494

You should use pyspark.sql.functions.unix_timestamp to convert your columns, then take the difference:

from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
    "session_length",
    unix_timestamp("session_end") - unix_timestamp("session_start")
)

Upvotes: 4

Related Questions