Reputation: 8395
I have a spark dataframe with a column having Date in the format dd-MMM-yyyy hh:mm.
How to do TimeRange query like -
Find all the rows between 2 dates and within specific time range of 4PM to 1AM.
This is possible in sql by using DatePart Specific Time Range Query in SQL Server
How to do the same in Spark Dataframe.
For example,
I want to find all the rows between 23-MAR-2016 till 25-MAR-2016 , within time range from 13:00:00 till 18:00:00 only.
So I must get only one row as result.
var input = spark.createDataFrame(Seq(
(13L, "Abhi c", "22-MAR-2016 09:10:12"),
(11L, "VF", "23-MAR-2016 16:24:25"),
(12L, "Alice Jones", "24-MAR-2016 19:20:25")
)).toDF("id", "name", "time")
input.filter("time between '23-MAR-2016' and '25-MAR-2016'").show()
+---+-----------+--------------------+
| id| name| time|
+---+-----------+--------------------+
| 11| VF|23-MAR-2016 16:24:25|
| 12|Alice Jones|24-MAR-2016 19:20:25|
+---+-----------+--------------------+
My Above query only filtered the date and even I can give time but how to get rows within a time range of each day.
Upvotes: 1
Views: 10155
Reputation: 788
There is a function to retrieve the hour of a timestamp. Here is how to select data that lies between 10am and 1pm in PySpark:
from pyspark.sql.functions import hour
data.select("ts").where((hour("ts") > 10) & (hour("ts") < 13))
You can further filter with, e.g., pyspark.sql.functions.month
, pyspark.sql.functions.year
, or pyspark.sql.functions.dayofmonth
Upvotes: 1
Reputation: 6739
You can use the following in case if you don't want to add any new intermediate columns to your dataframe.
import org.apache.spark.sql.functions._
var input = spark.createDataFrame(Seq((13L, "Abhi c", "22-MAR-2016 09:10:12"), (11L, "VF", "23-MAR-2016 16:24:25"), (12L, "Alice Jones", "24-MAR-2016 19:20:25"))).toDF("id", "name", "time")
val ts = unix_timestamp($"time", "dd-MMM-yyyy HH:mm:ss").cast("timestamp")
input.filter("time between '23-MAR-2016' and '25-MAR-2016'").filter(hour(ts) >= 13 && hour(ts) <= 18).show
Output:
+---+----+--------------------+
| id|name| time|
+---+----+--------------------+
| 11| VF|23-MAR-2016 16:24:25|
+---+----+--------------------+
Upvotes: 0
Reputation: 564
You can do something like this:
import org.apache.spark.sql.functions.unix_timestamp
var input = spark.createDataFrame(Seq(
(13L, "Abhi c", "22-MAR-2016 09:10:12"),
(11L, "VF", "23-MAR-2016 16:24:25"),
(12L, "Alice Jones", "24-MAR-2016 19:20:25")
)).toDF("id", "name", "time")
val h = hour(unix_timestamp($"time", "dd-MMM-yyyy hh:mm:ss").cast("timestamp"))
input.withColumn("hour", h).filter("time BETWEEN '23-MAR-2016' AND '25-MAR-2016' AND hour BETWEEN 13 AND 18").show()
+---+----+--------------------+----+
| id|name| time|hour|
+---+----+--------------------+----+
| 11| VF|23-MAR-2016 16:24:25| 16|
+---+----+--------------------+----+
Upvotes: 3