Reputation: 285
I have a dataframe in pyspark that has a timestamp string column in the following format:
"11/21/2018 07:21:49 PM"
I want to filter the rows in the dataframe based on only the time portion of this string timestamp regardless of the date. For example I want to keep all rows that fall between the hours of 2:00pm and 4:00pm inclusive.
I tried the below to extract the HH:mm:ss and use the function between but it is not working.
# Grabbing only time portion from datetime column
import pyspark.sql.functions as F
time_format = "HH:mm:ss"
split_col = F.split(df['datetime'], ' ')
df = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
df = df.withColumn('Timestamp', from_unixtime(unix_timestamp('Time', format=time_format)))
df.filter(F.col("Timestamp").between('14:00:00','16:00:00')).show()
Any ideas on how to filter rows only based on the HH:mm:ss portion in a timestamp column regardless of the actual date, would be very appreciated.
Upvotes: 1
Views: 1405
Reputation: 31540
Format your timestamp to HH:mm:ss
then filter
using between
clause.
Example:
df=spark.createDataFrame([("11/21/2018 07:21:49 PM",),("11/22/2018 04:21:49 PM",),("11/23/2018 12:21:49 PM",)],["ts"])
from pyspark.sql.functions import *
df.withColumn("tt",from_unixtime(unix_timestamp(col("ts"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).\
filter(col("tt").between("12:00","16:00")).\
show()
#+----------------------+--------+
#|ts |tt |
#+----------------------+--------+
#|11/23/2018 12:21:49 PM|12:21:49|
#+----------------------+--------+
Upvotes: 1