Abhishek Choudhary
Abhishek Choudhary

Reputation: 8395

Spark DataFrame query between 2 specific Time Range

I have a spark dataframe with a column having Date in the format dd-MMM-yyyy hh:mm.

How to do TimeRange query like -

Find all the rows between 2 dates and within specific time range of 4PM to 1AM.

This is possible in sql by using DatePart Specific Time Range Query in SQL Server

How to do the same in Spark Dataframe.

For example,

I want to find all the rows between 23-MAR-2016 till 25-MAR-2016 , within time range from 13:00:00 till 18:00:00 only.

So I must get only one row as result.

var input = spark.createDataFrame(Seq(
        (13L, "Abhi c", "22-MAR-2016 09:10:12"),
        (11L, "VF", "23-MAR-2016 16:24:25"),
        (12L, "Alice Jones", "24-MAR-2016 19:20:25")
        )).toDF("id", "name", "time")

input.filter("time between '23-MAR-2016' and '25-MAR-2016'").show()

+---+-----------+--------------------+
| id|       name|                time|
+---+-----------+--------------------+
| 11|         VF|23-MAR-2016 16:24:25|
| 12|Alice Jones|24-MAR-2016 19:20:25|
+---+-----------+--------------------+

My Above query only filtered the date and even I can give time but how to get rows within a time range of each day.

Upvotes: 1

Views: 10155

Answers (3)

00schneider
00schneider

Reputation: 788

There is a function to retrieve the hour of a timestamp. Here is how to select data that lies between 10am and 1pm in PySpark:

from pyspark.sql.functions import hour

data.select("ts").where((hour("ts") > 10) & (hour("ts") < 13))

You can further filter with, e.g., pyspark.sql.functions.month, pyspark.sql.functions.year, or pyspark.sql.functions.dayofmonth

Upvotes: 1

Prasad Khode
Prasad Khode

Reputation: 6739

You can use the following in case if you don't want to add any new intermediate columns to your dataframe.

import org.apache.spark.sql.functions._

var input = spark.createDataFrame(Seq((13L, "Abhi c", "22-MAR-2016 09:10:12"), (11L, "VF", "23-MAR-2016 16:24:25"), (12L, "Alice Jones", "24-MAR-2016 19:20:25"))).toDF("id", "name", "time")

val ts = unix_timestamp($"time", "dd-MMM-yyyy HH:mm:ss").cast("timestamp")

input.filter("time between '23-MAR-2016' and '25-MAR-2016'").filter(hour(ts) >= 13 && hour(ts) <= 18).show

Output:

+---+----+--------------------+
| id|name|                time|
+---+----+--------------------+
| 11|  VF|23-MAR-2016 16:24:25|
+---+----+--------------------+

Upvotes: 0

Alex Karpov
Alex Karpov

Reputation: 564

You can do something like this:

import org.apache.spark.sql.functions.unix_timestamp

var input = spark.createDataFrame(Seq(
    (13L, "Abhi c", "22-MAR-2016 09:10:12"),
    (11L, "VF", "23-MAR-2016 16:24:25"),
    (12L, "Alice Jones", "24-MAR-2016 19:20:25")
    )).toDF("id", "name", "time")

val h = hour(unix_timestamp($"time", "dd-MMM-yyyy hh:mm:ss").cast("timestamp"))

input.withColumn("hour", h).filter("time BETWEEN '23-MAR-2016' AND '25-MAR-2016' AND hour BETWEEN 13 AND 18").show()

+---+----+--------------------+----+
| id|name|                time|hour|
+---+----+--------------------+----+
| 11|  VF|23-MAR-2016 16:24:25|  16|
+---+----+--------------------+----+

Upvotes: 3

Related Questions