Reputation: 595
My requirement is to filter dataframe based on timestamp column such that data which are only 10 minutes old. Dataframe looks like:
ID,timestamp,value
ID-1,8/23/2017 6:11:13,4.56
ID-2,8/23/2017 6:5:21,5.92
ID-3,8/23/2017 5:49:13,6.00
I am trying the following code but not getting the desired output
df.filter(to_date(df("timestamp")).gt(lit(now.get(Calendar.MINUTE)-10)))
Any help will be appreciated.
Upvotes: 2
Views: 7239
Reputation: 41957
Given the dataframe
as
+----+------------------+-----+
|ID |timestamp |value|
+----+------------------+-----+
|ID-1|8/23/2017 14:48:13|4.56 |
|ID-2|8/23/2017 6:5:21 |5.92 |
|ID-3|8/23/2017 5:49:13 |6.0 |
+----+------------------+-----+
and current time as
2017-08-23 14:53:33
You can do the following
import org.apache.spark.sql.functions._
df.withColumn("timestamp", unix_timestamp($"timestamp", "MM/dd/yyyy HH:mm:ss"))
.filter((unix_timestamp(current_timestamp()) - $"timestamp")/60 < 10)
.select($"ID", $"timestamp".cast(TimestampType), $"value")
where unix_timestamp
will convert the DateTimeStamp
to long
and current_timestamp()
will give the current date and time
You should have final dataframe
as
+----+---------------------+-----+
|ID |timestamp |value|
+----+---------------------+-----+
|ID-1|2017-08-23 14:48:13.0|4.56 |
+----+---------------------+-----+
I hope the answer is helpful
Upvotes: 4