pyspark How to filter rows based on HH:mm:ss portion in timestamp column

Question

I have a dataframe in pyspark that has a timestamp string column in the following format:

"11/21/2018 07:21:49 PM"

This is in 24 hours format.

I want to filter the rows in the dataframe based on only the time portion of this string timestamp regardless of the date. For example I want to keep all rows that fall between the hours of 2:00pm and 4:00pm inclusive.

I tried the below to extract the HH:mm:ss and use the function between but it is not working.

# Grabbing only time portion from datetime column 
import pyspark.sql.functions as F
time_format = "HH:mm:ss"
split_col = F.split(df['datetime'], ' ') 
df = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
df = df.withColumn('Timestamp', from_unixtime(unix_timestamp('Time', format=time_format)))

df.filter(F.col("Timestamp").between('14:00:00','16:00:00')).show()

Any ideas on how to filter rows only based on the HH:mm:ss portion in a timestamp column regardless of the actual date, would be very appreciated.

notNull · Accepted Answer

Format your timestamp to HH:mm:ss then filter using between clause.

Example:

df=spark.createDataFrame([("11/21/2018 07:21:49 PM",),("11/22/2018 04:21:49 PM",),("11/23/2018 12:21:49 PM",)],["ts"])

from pyspark.sql.functions import *

df.withColumn("tt",from_unixtime(unix_timestamp(col("ts"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).\
filter(col("tt").between("12:00","16:00")).\
show()
#+----------------------+--------+
#|ts                    |tt      |
#+----------------------+--------+
#|11/23/2018 12:21:49 PM|12:21:49|
#+----------------------+--------+

pyspark How to filter rows based on HH:mm:ss portion in timestamp column

Answers (1)

Related Questions