Raman Yelianevich
Raman Yelianevich

Reputation: 1137

How to convert Timestamp column to milliseconds Long column in Spark SQL

What is the shortest and the most efficient way in Spark SQL to transform Timestamp column to a milliseconds timestamp Long column?

Here is an example of a transformation from timestamp to milliseconds

scala> val ts = spark.sql("SELECT now() as ts")
ts: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> ts.show(false)
+-----------------------+                                                       
|ts                     |
+-----------------------+
|2019-06-18 12:32:02.41 |
+-----------------------+

scala> val tss = ts.selectExpr(
 |   "ts",
 |   "BIGINT(ts) as seconds_ts",
 |   "BIGINT(ts) * 1000 + BIGINT(date_format(ts, 'SSS')) as millis_ts"
 | )
tss: org.apache.spark.sql.DataFrame = [ts: timestamp, seconds_ts: bigint ... 1 more field]

scala> tss.show(false)
+----------------------+----------+-------------+                               
|ts                    |seconds_ts|millis_ts    |
+----------------------+----------+-------------+
|2019-06-18 12:32:02.41|1560861122|1560861122410|
+----------------------+----------+-------------+

As you can see, the most straightforward method to get milliseconds from timestamp doesn't work - cast to long returns seconds, however milliseconds information in timestamp is preserved.

The only way I found to to extract milliseconds information is by using date_format function , which is nothing like as simple as I would expect.

Does anybody know the way to get milliseconds UNIX time out of Timestamp column simpler than that?

Upvotes: 2

Views: 3287

Answers (1)

Michael Heil
Michael Heil

Reputation: 18495

According to the code on Spark's DateTimeUtils:

"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."

Therefore, if you define a UDF that has a java.sql.Timestamp as input you can simply call getTime for a Long in millisecond.

val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)

Applying this to a variety of Timestamps:

val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
  .toDF("timestampString")
  .withColumn("timestamp", to_timestamp(col("timestampString")))
  .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
  .withColumn("timestampCastAsLong", col("timestamp").cast(LongType))

df.printSchema()
df.show(false)

// returns
root
 |-- timestampString: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampConversionToLong: long (nullable = false)
 |-- timestampCastAsLong: long (nullable = true)

+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString        |timestamp              |timestampConversionToLong|timestampCastAsLong|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
+-----------------------+-----------------------+-------------------------+-------------------+

Note that the column "timestampCastAsLong" just shows that a direct cast to a Long will not return the desired result in milliseconds, but only in seconds.

Upvotes: 1

Related Questions