Sidd
Sidd

Reputation: 291

Pyspark Dataframe Filter Condition with boolean and String in & Clause

I have a very basic Dataframe filter requirement where i need to put 2 filter conditions. Here is my dataframe-

sqlContext.createDataFrame([('1000000','ABCD','20190516'),('1000000','ABCD','20190516'),('1000000','ABCD','20120516'),('5000000','WXYZ','20110516'),('2000000','PQRS','20140516'),('3000000','PQRS','20160516'),('3000000','PQRS','20150516')],['hashVal','prodCode','date'])
>>> df.show()
+-------+--------+--------+
|hashVal|prodCode|    date|
+-------+--------+--------+
|1000000|    ABCD|20190516|
|1000000|    ABCD|20190516|
|1000000|    ABCD|20120516|
|5000000|    WXYZ|20110516|
|2000000|    PQRS|20140516|
|3000000|    PQRS|20160516|
|3000000|    PQRS|20150516|
+-------+--------+--------+

The requirement is-

  1. I need all the distinct records from this Dataframe if prodCode is in ("ABCD","WXYZ")
  2. If prodCode is not in any of these 2 values ("ABCD","WXYZ"), then i only need distinct records which are >"20140516"

So desired out is -

+-------+--------+--------+
|hashVal|prodCode|    date|
+-------+--------+--------+
|1000000|    ABCD|20120516|
|5000000|    WXYZ|20110516|
|1000000|    ABCD|20190516|
|3000000|    PQRS|20160516|
|3000000|    PQRS|20150516|
+-------+--------+--------+

If i try with below code-

df.where(( col("prodCode").isin (["ABCD","WXYZ"])) | (  ~( col("prodCode").isin(["ABCD","WXYZ"])) & col("date")>"20140516"  ) ).distinct()

It gives me the error-

u"cannot resolve '(NOT prodCode IN (ABCD,WXYZ) && date)' due to data type mismatch: differing types in '(NOT prodCode IN (ABCD,WXYZ) && date)' (boolean and string).;"

I do get the correct solution if i just break it 2 two conditions, however with this solution, i have to perform a Union which i am looking to avoid.

dfA = df.where( ( col("prodCode").isin (["ABCD","WXYZ"])) ).distinct()
dfB = df.where(~( col("prodCode").isin (["ABCD","WXYZ"])) ).where(col("date")>"20140516").distinct()
dfF = dfA.unionAll(dfB)

How can i make it work without having to split My dataframe with 2 different conditions and then merge again?

Upvotes: 0

Views: 2323

Answers (1)

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

You are missing a parentheses around the condition. If there is more than one condition, each of them should be parenthesized. Also, one of the conditions here includes an AND within it. Make sure the individual conditions are separated by parentheses as well, to get the desired results. A OR B AND C gives a different result compared A OR (B AND C).

Having said that, your condition should be

df.where((col("prodCode").isin (["ABCD","WXYZ"])) | ((~(col("prodCode").isin(["ABCD","WXYZ"]))) & (col("date")>"20140516"))).distinct()

Upvotes: 2

Related Questions