Reputation: 1137
What is the shortest and the most efficient way in Spark SQL to transform Timestamp
column to a milliseconds timestamp Long
column?
Here is an example of a transformation from timestamp to milliseconds
scala> val ts = spark.sql("SELECT now() as ts")
ts: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> ts.show(false)
+-----------------------+
|ts |
+-----------------------+
|2019-06-18 12:32:02.41 |
+-----------------------+
scala> val tss = ts.selectExpr(
| "ts",
| "BIGINT(ts) as seconds_ts",
| "BIGINT(ts) * 1000 + BIGINT(date_format(ts, 'SSS')) as millis_ts"
| )
tss: org.apache.spark.sql.DataFrame = [ts: timestamp, seconds_ts: bigint ... 1 more field]
scala> tss.show(false)
+----------------------+----------+-------------+
|ts |seconds_ts|millis_ts |
+----------------------+----------+-------------+
|2019-06-18 12:32:02.41|1560861122|1560861122410|
+----------------------+----------+-------------+
As you can see, the most straightforward method to get milliseconds from timestamp doesn't work - cast to long returns seconds, however milliseconds information in timestamp is preserved.
The only way I found to to extract milliseconds information is by using date_format
function , which is nothing like as simple as I would expect.
Does anybody know the way to get milliseconds UNIX time out of Timestamp
column simpler than that?
Upvotes: 2
Views: 3287
Reputation: 18495
According to the code on Spark's DateTimeUtils:
"Timestamps are exposed externally as
java.sql.Timestamp
and are stored internally aslongs
, which are capable of storing timestamps with microsecond precision."
Therefore, if you define a UDF that has a java.sql.Timestamp
as input you can simply call getTime
for a Long in millisecond.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
.toDF("timestampString")
.withColumn("timestamp", to_timestamp(col("timestampString")))
.withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
.withColumn("timestampCastAsLong", col("timestamp").cast(LongType))
df.printSchema()
df.show(false)
// returns
root
|-- timestampString: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampConversionToLong: long (nullable = false)
|-- timestampCastAsLong: long (nullable = true)
+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString |timestamp |timestampConversionToLong|timestampCastAsLong|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00 |1484733600000 |1484733600 |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111 |1484733600 |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110 |1484733600 |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1 |1484733600100 |1484733600 |
+-----------------------+-----------------------+-------------------------+-------------------+
Note that the column "timestampCastAsLong" just shows that a direct cast to a Long
will not return the desired result in milliseconds, but only in seconds.
Upvotes: 1