Georg Heiler
Georg Heiler

Reputation: 17724

Spark time difference

From spark 1.5.x on some convenience methods were added to deal with time.

df.select(datediff($"A", $"B")).show

But this returns the difference in days. Is there any option to convert it to a different interval e.g. years?

Manually I used

df.select(year($"A")- year($"B")).show

which is good enough. But it would be better if the date difference was calculated more accurately e.g. including months / days like displayed http://howtodoinjava.com/for-fun-only/java-code-to-calculate-age-from-date-of-birth/

Is a UDF the go-to solution for current spark 2.0.2? or Am I missing an option to convert the difference as days to e.g. years?

Upvotes: 1

Views: 3289

Answers (1)

zero323
zero323

Reputation: 330413

Timestamp type can be converted to its the Unix timestamp in seconds. If this resolution is good enough for you the rest is just a bunch of arithmetic operations. You can compute difference in seconds as:

import org.apache.spark.sql.Column

def secondsBetween(col1: Column, col2: Column) =
  col2.cast("timestamp").cast("bigint") - col1.cast("timestamp").cast("bigint")

and re-sample according to your needs, for example:

def minutesBetween(col1: Column, col2: Column) =
  (secondsBetween(col1, col2) / 60).cast("bigint")

Obviously this doesn't handle all the subtleties of date and time processing like daylight saving time or leap second so if these are important in your domain then UDF with proper time processing library is recommended.

Upvotes: 5

Related Questions