Reputation: 588
I have a table like the following:
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| B| Y| Y|
| C|null| Y|
| D| Y|null|
| E|null|null|
+---+----+----+
I can easilty filter single or both columns when value equals "Y". For example, the following filters when both columns are "Y":
df.filter((df["Buy"] == "Y") & (df["Sell"] == "Y"))
However, how can I filter for when a single column or both columns do not equal "Y"?? What is the code for each case? I have tried these codes, and they return no rows:
df.filter((df["Buy"] != "Y") & (df["Sell"] != "Y"))
df.filter(~((df["Buy"] == "Y") | (df["Sell"] == "Y")))
It seems it does not capture null values
Upvotes: 2
Views: 9554
Reputation: 1006
You can always try with spark SQL by creating a temporary view and write queries naturally in SQL. Such as for this we can write
df.createOrReplaceTempView('filter_value_not_equal_to_Y')
filterNotEqual=spark.sql("Select * from filter_value_not_equal_to_Y where Sell <>'Y' or Buy <>'Y'")
display(filterNotEqual)
Upvotes: 0
Reputation: 5582
The easy and simple fix would be to use null safe operator
df.filter(~((df["Buy"].eqNullSafe("Y")) | (df["Sell"].eqNullSafe("Y"))))
Upvotes: 2
Reputation: 13591
Fill the null value first and do the filter.
df.na.fill('N') \
.filter("Sell != 'Y' or Buy != 'Y'") \
.show(10, False)
df.na.fill('N') \
.filter("Sell != 'Y' and Buy != 'Y'") \
.show(10, False)
+---+----+---+
|id |Sell|Buy|
+---+----+---+
|A |N |N |
|C |N |Y |
|D |Y |N |
|E |N |N |
+---+----+---+
+---+----+---+
|id |Sell|Buy|
+---+----+---+
|A |N |N |
|E |N |N |
+---+----+---+
Upvotes: 1
Reputation: 7926
I think the problem is that they are null, null values are somehow special
Try this to filter for values where Buy
is not Y
df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()))
So if you want to filter where Buy and Sell are not 'Y' as it seems by what you've tried, you need to do this:
df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull()))
Quick example:
Input
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| B| Y| Y|
| C| Y|null|
| D| Y|null|
| E|null|null|
+---+----+----+
Output
>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| C| Y|null|
| D| Y|null|
| E|null|null|
+---+----+----+
>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| E|null|null|
+---+----+----+
Upvotes: 2