Eleanor
Eleanor

Reputation: 2847

pyspark check if HH:mm:ss is in a range

I have some data looks like this.

time
08:28:24
22:20:54 
12:59:38
21:46:07

I want to select the time that stand between 16:00:00 and 23:59:59, this is a closed range.

What should i do with it? ('Time' column type is string.)

Thank you!

Upvotes: 0

Views: 2995

Answers (2)

pault
pault

Reputation: 43544

Your condition can be simplified to checking if the hour part of your time column is between 16 and 23.

You can get the hour by using pyspark.sql.functions.split to tokenize the time column on the : character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between() (which is inclusive of the bounds).

from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#|    time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+

Note that even though split returns a string, there is an implicit conversion to int to do the between comparison.


Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:

df.select(
    "*",
    split("time", ":")[0].cast("int").alias("hour"),
    split("time", ":")[1].cast("int").alias("minute"),
    split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#|    time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24|   8|    28|    24|
#|22:20:54|  22|    20|    54|
#|12:59:38|  12|    59|    38|
#|21:46:07|  21|    46|     7|
#+--------+----+------+------+

Upvotes: 2

cronoik
cronoik

Reputation: 19520

You can use the column function between when you convert your column. See the example with comments below:

import datetime
from pyspark.sql import functions as F
from pyspark.sql import types as T

#pyspark does not provide a type to hold time only values,
#therefore we have to work with the datetime format.
#When your string  only represents the time, pyspark will automatically add the current date.
#The filter expression we will use later for your range needs also the current date.
now = datetime.datetime.now().strftime("%Y-%m-%d")

l1 = [('08:28:24',)
    ,('22:20:54',)
    ,('12:59:38',)
    ,('21:46:07',)
    ,('16:00:00',)]

df = spark.createDataFrame(l1,['time'])
#Converting
df = df.withColumn('time', df.time.cast(T.TimestampType()))
#Applying your filter with the current date
df = df.filter(F.col("time").between(now + ' 16:00:00',now + ' 23:59:59'))

df.show()

Output:

+-------------------+ 
|               time|
+-------------------+
|2019-03-12 22:20:54| 
|2019-03-12 21:46:07| 
|2019-03-12 16:00:00| 
+-------------------+

Upvotes: 0

Related Questions