Reputation: 169
I am trying to compare 2 timestamp columns and return the value which is minimum value among the two and wondering is there a better way than what I have. Note both columns might have values / one of them has value and other could be null. I know about When.otherwise.
import org.apache.spark.sql.functions._
import java.sql.Timestamp
val compareTime = udf((t1: Timestamp, t2: Timestamp) => {
if(t1 != null && t2 != null && t1.before(t2) ){
Some(t1)
}
else if(t1 != null && t2 != null && t2.before(t1)){
Some(t2)
}
else if(t1 != null){
Some(t1)
}
else if(t2 != null){
Some(t2)
}
else {
None
}
})
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",compareTime($"t1",$"t2"))
df.show()
Upvotes: 0
Views: 952
Reputation: 42352
UDF is probably unnecessary here - you can use the Spark SQL function least
:
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",least($"t1",$"t2"))
df.show()
+---+-------------------+-------------------+-------------------+
| id| t1| t2| t3|
+---+-------------------+-------------------+-------------------+
| 1|2021-01-04 16:10:00|2021-01-04 15:20:00|2021-01-04 15:20:00|
+---+-------------------+-------------------+-------------------+
The opposite of least
is greatest
, if you want to get the larger one of the two columns.
Note that both least
and greatest
will ignore null
values, but they will return null
if all input columns are null
.
Upvotes: 1
Reputation: 51271
Try this:
(Option(t1) ++ Option(t2)).minOption
It should do the same job as your if
..else if
..else
stack.
Oops. My bad. Spark doesn't do Scala 2.13.x. Try this instead:
util.Try((Option(t1) ++ Option(t2)).minBy(_.getTime())).toOption
Upvotes: 1