Reputation: 291
I have a very basic Dataframe filter requirement where i need to put 2 filter conditions. Here is my dataframe-
sqlContext.createDataFrame([('1000000','ABCD','20190516'),('1000000','ABCD','20190516'),('1000000','ABCD','20120516'),('5000000','WXYZ','20110516'),('2000000','PQRS','20140516'),('3000000','PQRS','20160516'),('3000000','PQRS','20150516')],['hashVal','prodCode','date'])
>>> df.show()
+-------+--------+--------+
|hashVal|prodCode| date|
+-------+--------+--------+
|1000000| ABCD|20190516|
|1000000| ABCD|20190516|
|1000000| ABCD|20120516|
|5000000| WXYZ|20110516|
|2000000| PQRS|20140516|
|3000000| PQRS|20160516|
|3000000| PQRS|20150516|
+-------+--------+--------+
The requirement is-
So desired out is -
+-------+--------+--------+
|hashVal|prodCode| date|
+-------+--------+--------+
|1000000| ABCD|20120516|
|5000000| WXYZ|20110516|
|1000000| ABCD|20190516|
|3000000| PQRS|20160516|
|3000000| PQRS|20150516|
+-------+--------+--------+
If i try with below code-
df.where(( col("prodCode").isin (["ABCD","WXYZ"])) | ( ~( col("prodCode").isin(["ABCD","WXYZ"])) & col("date")>"20140516" ) ).distinct()
It gives me the error-
u"cannot resolve '(NOT prodCode IN (ABCD,WXYZ) && date)' due to data type mismatch: differing types in '(NOT prodCode IN (ABCD,WXYZ) && date)' (boolean and string).;"
I do get the correct solution if i just break it 2 two conditions, however with this solution, i have to perform a Union which i am looking to avoid.
dfA = df.where( ( col("prodCode").isin (["ABCD","WXYZ"])) ).distinct()
dfB = df.where(~( col("prodCode").isin (["ABCD","WXYZ"])) ).where(col("date")>"20140516").distinct()
dfF = dfA.unionAll(dfB)
How can i make it work without having to split My dataframe with 2 different conditions and then merge again?
Upvotes: 0
Views: 2323
Reputation: 49260
You are missing a parentheses around the condition. If there is more than one condition, each of them should be parenthesized. Also, one of the conditions here includes an AND
within it. Make sure the individual conditions are separated by parentheses as well, to get the desired results. A OR B AND C
gives a different result compared A OR (B AND C)
.
Having said that, your condition should be
df.where((col("prodCode").isin (["ABCD","WXYZ"])) | ((~(col("prodCode").isin(["ABCD","WXYZ"]))) & (col("date")>"20140516"))).distinct()
Upvotes: 2