Reputation: 3536
I have a DataFrame which contains a lot of repeated values. An aggregated, distinct count of it looks like below
> df.groupby('fruits').count().sort(F.desc('count')).show()
| fruits | count |
| ----------- | ----------- |
| [Apples] | 123 |
| [] | 344 |
| [Apples, plum]| 444 |
My goal is to filter all rows where the value is either [Apples]
or []
.
Suprisingly, the following works for an non-empty array but for empty it doesn't
import pyspark.sql.types as T
is_apples = F.udf(lambda arr: arr == ['Apples'], T.BooleanType())
df.filter(is_apples(df.fruits).count() # WORKS! shows 123 correctly.
is_empty = F.udf(lambda arr: arr == [], T.BooleanType())
df.filter(is_empty(df.fruits).count() # Doesn't work! Should show 344 but shows zero.
Any idea what I am doing wrong?
Upvotes: 3
Views: 3702
Reputation: 6748
You can do it by checking the length if the array.
import pyspark.sql.types as T
import pyspark.sql.functions as F
is_empty = F.udf(lambda arr: len(arr) == 0, T.BooleanType())
df.filter(is_empty(df.fruits).count()
Upvotes: 3
Reputation: 11
If you don't want to use UDF, you can use F.size to get the size of the array.
To filter empty array:
df.filter(F.size(df.fruits) == 0)
To filter non-empty array:
df.filter(F.size(df.fruits) != 0)
Upvotes: 1
Reputation: 42342
It might be an array containing an empty string:
is_empty = F.udf(lambda arr: arr == [''], T.BooleanType())
Or it might be an array of null:
is_empty = F.udf(lambda arr: arr == [None], T.BooleanType())
To check them all at once you can use:
is_empty = F.udf(lambda arr: arr in [[], [''], [None]], T.BooleanType())
But actually you don't need a UDF for this, e.g. you can do:
df.filter("fruits = array() or fruits = array('') or fruits = array(null)")
Upvotes: 3