Spark using timestamp inside a RDD

Question

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird. I read a dummy csv file like the following to build the input dataframe :

"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"

df.show(2)
+-------------------+
|        ts         |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+

For now, nothing to report, but then :

df.rdd.map { row =>
  val timestamp = row.getTimestamp(0)
  val timestampMilli = timestamp.toInstant.toEpochMilli
  val epoch = Timestamp.from(Instant.EPOCH)
  val epochMilli = epoch.toInstant.toEpochMilli
  (timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)

(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)

I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0. Anyone know what I'm missing ?

NB : I already setup the session timezone to UTC, using the following properties.

spark.sql.session.timeZone=UTC
user.timezone=UTC

Matt Johnson-Pint · Accepted Answer

The java.sql.Timestamp class inherits from java.util.Date. They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.

I believe your OS (or environment) is set to something similar to Europe/London. Keep in mind that at the Unix epoch (1970-01-01T00:00:00Z), London was on BST (UTC+1).

Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00, which is equivalent to 1969-12-31T23:00:00Z.

Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z, which is equivalent to 1970-01-01T01:00:00+01:00.

Spark using timestamp inside a RDD

Answers (1)

Related Questions