Compare timestamp columns - Spark scala

Question

I am trying to compare 2 timestamp columns and return the value which is minimum value among the two and wondering is there a better way than what I have. Note both columns might have values / one of them has value and other could be null. I know about When.otherwise.

import org.apache.spark.sql.functions._
import java.sql.Timestamp
 val compareTime = udf((t1: Timestamp, t2: Timestamp) => {
    if(t1 != null && t2 != null && t1.before(t2) ){
     Some(t1)
    }
    else if(t1 != null && t2 != null && t2.before(t1)){
     Some(t2)
    } 
    else if(t1 != null){
     Some(t1)
    } 
     else if(t2 != null){
     Some(t2)
    } 
    else {
       None
    }
  })
 

var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
      .withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))

df = df.withColumn("t3",compareTime($"t1",$"t2"))
df.show()

mck · Accepted Answer

UDF is probably unnecessary here - you can use the Spark SQL function least:

var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
      .withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))

df = df.withColumn("t3",least($"t1",$"t2"))
df.show()
+---+-------------------+-------------------+-------------------+
| id|                 t1|                 t2|                 t3|
+---+-------------------+-------------------+-------------------+
|  1|2021-01-04 16:10:00|2021-01-04 15:20:00|2021-01-04 15:20:00|
+---+-------------------+-------------------+-------------------+

The opposite of least is greatest, if you want to get the larger one of the two columns.

Note that both least and greatest will ignore null values, but they will return null if all input columns are null.

Compare timestamp columns - Spark scala

Answers (2)

Related Questions