Reputation: 2843
I have a pyspark dataframe with four timestamp columns (clock_in
, clock_out
, sign_in
, and sign_out
). I need a third column (duration
) that is a difference between two columns, but which two I'm taking the difference of depends on the values of the columns themselves, according to the following rules:
clock_in > clock_out
, then duration = clock_in - clock_out
sign_in > clock_out
, then duration = sign_out - clock_out
clock_in > sign_in
, then duration = clock_in - clock_out
duration = 0
I've tried defining a UDF and applying it as follows:
from pyspark.sql import *
def calc_dur(clock_in, clock_out, sign_in, sign_out):
if clock_in > clock_out:
return clock_in - clock_out
elif sign_out > clock_out:
return sign_out - clock_out
elif sign_in < clock_in:
return clock_in - sign_in
else:
return 0
dur_udf = udf(calc_dur)
df2 = df.withColumn("duration", dur_udf(df.clock_in, df.clock_out, df.sign_in, df.sign_out))
However, this fills my duration
column with Timedeltas (e.g., Timedelta: 0 days, 1740 seconds, 0 microseconds (total: 1740 seconds)
), and when I try to filter to only rows with non-zero values in the duration
column, it returns zero rows.
What is the correct way of doing this?
Upvotes: 0
Views: 3646
Reputation: 1553
Without having to use UDF (which are slower in my opinion), you could do this with strict Spark SQL syntax I believe :
The columns should be casted as long beforehand (if they are in Timestamp format). So you can add this piece of code after every F.col() :
F.col("...").cast(LongType())
Link to the good SO answer : https://stackoverflow.com/a/37058395/6434448
This code would work then I hope :
from pyspark.sql import functions as F
from pyspark.sql.types import LongType, TimestampType
df.withColumn("duration",
F.when(F.col("clock_in") > F.col("clock_out"), F.col("clock_in") - F.col("clock_out")).otherwise(
F.when(F.col("sign_out") > F.col("clock_out"), F.col("sign_out") - F.col("clock_out")).otherwise(
F.when(F.col("sign_in") < F.col("clock_in"), F.col("clock_in") - F.col("sign_in")).otherwise(F.lit(0))
))
)
Upvotes: 1