Tim
Tim

Reputation: 2843

Pyspark timestamp difference based on column values

I have a pyspark dataframe with four timestamp columns (clock_in, clock_out, sign_in, and sign_out). I need a third column (duration) that is a difference between two columns, but which two I'm taking the difference of depends on the values of the columns themselves, according to the following rules:

  1. If clock_in > clock_out, then duration = clock_in - clock_out
  2. Else, if sign_in > clock_out, then duration = sign_out - clock_out
  3. Else, if clock_in > sign_in, then duration = clock_in - clock_out
  4. Otherwise, duration = 0

I've tried defining a UDF and applying it as follows:

from pyspark.sql import *

def calc_dur(clock_in, clock_out, sign_in, sign_out):
    if clock_in > clock_out:
        return clock_in - clock_out
    elif sign_out > clock_out:
        return sign_out - clock_out
    elif sign_in < clock_in:
        return clock_in - sign_in
    else:
        return 0

dur_udf = udf(calc_dur)

df2 = df.withColumn("duration", dur_udf(df.clock_in, df.clock_out, df.sign_in, df.sign_out))

However, this fills my duration column with Timedeltas (e.g., Timedelta: 0 days, 1740 seconds, 0 microseconds (total: 1740 seconds)), and when I try to filter to only rows with non-zero values in the duration column, it returns zero rows.

What is the correct way of doing this?

Upvotes: 0

Views: 3646

Answers (1)

tricky
tricky

Reputation: 1553

Without having to use UDF (which are slower in my opinion), you could do this with strict Spark SQL syntax I believe :

The columns should be casted as long beforehand (if they are in Timestamp format). So you can add this piece of code after every F.col() :

F.col("...").cast(LongType())

Link to the good SO answer : https://stackoverflow.com/a/37058395/6434448

This code would work then I hope :

from pyspark.sql import functions as F
from pyspark.sql.types import LongType, TimestampType

df.withColumn("duration",
              F.when(F.col("clock_in") > F.col("clock_out"), F.col("clock_in") - F.col("clock_out")).otherwise(
              F.when(F.col("sign_out") > F.col("clock_out"), F.col("sign_out") - F.col("clock_out")).otherwise(
              F.when(F.col("sign_in") < F.col("clock_in"), F.col("clock_in") - F.col("sign_in")).otherwise(F.lit(0))
     ))
)

Upvotes: 1

Related Questions