wjie08
wjie08

Reputation: 445

How to use SparkSQL to select rows in Spark DF based on multiple conditions

I am relatively new to pyspark and I have a spark dataframe with a date column "Issue_Date". The "Issue_Date" column contains several dates from 1970-2060 (due to errors). From the spark dataframe, I have created a temp table from it and have been able to filter the data from year 2018. I would also like to include the data from year 2019 (i.e., multiple conditions). Is there a way to do so? I've tried many combinations but couldn't get it. Any form of help is appreciated, thank you.

# Filter data from 2018
sparkdf3.createOrReplaceTempView("table_view")
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) = 2018")

sparkdf4.count()

Upvotes: 0

Views: 250

Answers (1)

pissall
pissall

Reputation: 7399

Did you try using year(Issue_Date) >= 2018?:

sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) >= 2018")

If your column has errors, and you want to specify a range you can use year IN (2018, 2019):

sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) in (2018, 2019)")

Upvotes: 1

Related Questions