Harshit
Harshit

Reputation: 560

Filter spark Dataframe with specific timestamp literal

I have to filter records in dataframe with all records greater than a specific timestamp.

My timestamp column in dataframe is in string format.

Name DOJ
---------
Ram  01-Jan-2000 00.00.00.001 AM
Tom  01-Jan-2020 00.00.00.001 AM
Ali  01-Jan-2010 00.00.00.001 AM

df.PrintSchema
Name|--String
DOJ |--String

I need to filter all records which are greater than ("01-Jan-2005 00.00.00.001 AM")

Everything is in String Format

I tried various ways but results are not consistent it seems it comparing string. Probably I am facing issues in converting literal to timestamp and getting variable results.

Upvotes: 0

Views: 2142

Answers (1)

Raghu
Raghu

Reputation: 1712

you cannot compare time when it is in string. Pyspark offers a api to convert strings to timestamp : https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=to_timestamp#pyspark.sql.functions.to_timestamp With this you will be able to convert your column as a dateType and then can do comparisions.

Try it out.

In your case, you are using the 12 hour format (hh), here the allowed values are 1-12 as per SimpleDateFormat that spark uses. Refer https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html

So if you need a 24 hour format change the hour string to HH

Since you are giving 00 to the hh, it is becoming invalid. The below code works . Also careful about the format you give and the string format. if there is a mismatch then you get null.

import pyspark.sql.functions as F

tst=sqlContext.createDataFrame([('Ram','01-Jan-2000 01.00.00.001 AM'),('Tom','01-Jan-2020 11.00.00.001 AM'),('Ali',' 01-Jan-2010 07.00.00.001 AM')],schema=["Name","DOJ"])
tst1=tst.withColumn("ts",F.to_timestamp(F.col("DOJ"),format="dd-MMM-yyyy hh.mm.ss.SSS aa"))
#%%
tst2 =tst1.filter(F.col('ts')>F.to_timestamp(F.lit("01-Jan-2005 01.00.00.001 AM"),format="dd-MMM-yyyy hh.mm.ss.SSS aa"))

results :

|Name|DOJ                         |ts                 |
+----+----------------------------+-------------------+
|Tom |01-Jan-2020 11.00.00.001 AM |2020-01-01 11:00:00|
|Ali | 01-Jan-2010 07.00.00.001 AM|2010-01-01 07:00:00|
+----+----------------------------+-------------------+

Upvotes: 1

Related Questions