Reputation: 101
I am using Spark 2.0 and looking for a way to achieve the following in Scala:
Need the time-stamp difference in milliseconds between two Data-frame column values.
Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067
Data-types for both is timestamp.
Note:Applying the function unix_timestamp(Column s) on both values and subtracting works but not upto the milliseconds value which is the requirement.
Final query would look like this:
Select **timestamp_diff**(Value_2,Value_1) from table1
this should return the following output:
1023 milliseconds
where timestamp_diff
is the function that would calculate the difference in milliseconds.
Upvotes: 9
Views: 8043
Reputation: 829
Bit late to the party, but hope it's still useful.
import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")
df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))
Of course you can define a separate method for the difference:
def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)
df.withColumn("diff", timestampDiff(col("col1"), col("col2")))
To make life easier one can define an overloaded method for String
s with a default diff
name:
def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")
Now in action:
scala> df.show(false)
+-----------------------+-----------------------+
|min_time |max_time |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+
scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time |max_time |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142 |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209 |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219 |
+-----------------------+-----------------------+-----+
scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142 |
|65363|
|209 |
|219 |
+-----+
Upvotes: 1
Reputation: 2110
The same for PySpark:
import datetime
def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
return int((time1-time2).total_seconds()*1000)
int
and *1000
are only to output milliseconds
Example usage:
spark.udf.register("timestamp_diff", timestamp_diff)
df.registerTempTable("table1")
df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
It's not an optimal solution since UDFs are usually slow, so you might run into performance issues.
Upvotes: 2
Reputation: 28332
One way would be to use Unix epoch time, the number of milliseconds since 1 January 1970. Below is an example using an UDF
, it takes two timestamps and returns the difference between them in milliseconds.
val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
})
val df = // dataframe with two timestamp columns (col1 and col2)
.withColumn("diff", timestamp_diff(col("col2"), col("col1")))
Alternatively, you can register the function to use with SQL commands:
val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
}
spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")
val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
Upvotes: 6