Prakhar
Prakhar

Reputation: 3536

Filtering a column with an empty array in Pyspark

I have a DataFrame which contains a lot of repeated values. An aggregated, distinct count of it looks like below

> df.groupby('fruits').count().sort(F.desc('count')).show()


| fruits        | count       |
| -----------   | ----------- |
| [Apples]      | 123         |
| []            | 344         |
| [Apples, plum]| 444         |

My goal is to filter all rows where the value is either [Apples] or [].

Suprisingly, the following works for an non-empty array but for empty it doesn't

import pyspark.sql.types as T

is_apples = F.udf(lambda arr: arr == ['Apples'], T.BooleanType())
df.filter(is_apples(df.fruits).count() # WORKS! shows 123 correctly.

is_empty = F.udf(lambda arr: arr == [], T.BooleanType())
df.filter(is_empty(df.fruits).count() # Doesn't work! Should show 344 but shows zero.

Any idea what I am doing wrong?

Upvotes: 3

Views: 3702

Answers (3)

Equinox
Equinox

Reputation: 6748

You can do it by checking the length if the array.

import pyspark.sql.types as T
import pyspark.sql.functions as F

is_empty = F.udf(lambda arr: len(arr) == 0, T.BooleanType())
df.filter(is_empty(df.fruits).count()

Upvotes: 3

Quang Nguyen
Quang Nguyen

Reputation: 11

If you don't want to use UDF, you can use F.size to get the size of the array.

To filter empty array:

df.filter(F.size(df.fruits) == 0)

To filter non-empty array:

df.filter(F.size(df.fruits) != 0)

Upvotes: 1

mck
mck

Reputation: 42342

It might be an array containing an empty string:

is_empty = F.udf(lambda arr: arr == [''], T.BooleanType())

Or it might be an array of null:

is_empty = F.udf(lambda arr: arr == [None], T.BooleanType())

To check them all at once you can use:

is_empty = F.udf(lambda arr: arr in [[], [''], [None]], T.BooleanType())

But actually you don't need a UDF for this, e.g. you can do:

df.filter("fruits = array() or fruits = array('') or fruits = array(null)")

Upvotes: 3

Related Questions