mah65
mah65

Reputation: 588

Filter values not equal in pyspark

I have a table like the following:

+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  B|   Y|   Y|
|  C|null|   Y|
|  D|   Y|null|
|  E|null|null|
+---+----+----+

I can easilty filter single or both columns when value equals "Y". For example, the following filters when both columns are "Y":

df.filter((df["Buy"] == "Y") & (df["Sell"] == "Y"))

However, how can I filter for when a single column or both columns do not equal "Y"?? What is the code for each case? I have tried these codes, and they return no rows:

df.filter((df["Buy"] != "Y") & (df["Sell"] != "Y"))
df.filter(~((df["Buy"] == "Y") | (df["Sell"] == "Y")))

It seems it does not capture null values

Upvotes: 2

Views: 9554

Answers (4)

nilakantha singh deo
nilakantha singh deo

Reputation: 1006

You can always try with spark SQL by creating a temporary view and write queries naturally in SQL. Such as for this we can write

df.createOrReplaceTempView('filter_value_not_equal_to_Y')
filterNotEqual=spark.sql("Select * from filter_value_not_equal_to_Y where Sell <>'Y' or Buy <>'Y'")
display(filterNotEqual)

Upvotes: 0

MikA
MikA

Reputation: 5582

The easy and simple fix would be to use null safe operator

df.filter(~((df["Buy"].eqNullSafe("Y")) | (df["Sell"].eqNullSafe("Y"))))

Upvotes: 2

Lamanus
Lamanus

Reputation: 13591

Fill the null value first and do the filter.

df.na.fill('N') \
  .filter("Sell != 'Y' or Buy != 'Y'") \
  .show(10, False)

df.na.fill('N') \
  .filter("Sell != 'Y' and Buy != 'Y'") \
  .show(10, False)

+---+----+---+
|id |Sell|Buy|
+---+----+---+
|A  |N   |N  |
|C  |N   |Y  |
|D  |Y   |N  |
|E  |N   |N  |
+---+----+---+

+---+----+---+
|id |Sell|Buy|
+---+----+---+
|A  |N   |N  |
|E  |N   |N  |
+---+----+---+

Upvotes: 1

SCouto
SCouto

Reputation: 7926

I think the problem is that they are null, null values are somehow special

Try this to filter for values where Buy is not Y

df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()))

So if you want to filter where Buy and Sell are not 'Y' as it seems by what you've tried, you need to do this:

 df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull()))

Quick example:

Input

+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  B|   Y|   Y|
|  C|   Y|null|
|  D|   Y|null|
|  E|null|null|
+---+----+----+

Output

>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  C|   Y|null|
|  D|   Y|null|
|  E|null|null|
+---+----+----+

>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  E|null|null|
+---+----+----+

Upvotes: 2

Related Questions