0vbb
0vbb

Reputation: 903

Pyspark: filter dataframe based on list with many conditions

Suppose you have a pyspark dataframe df with columns A and B.
Now, you want to filter the dataframe with many conditions.

The conditions are contained in a list of dicts:

l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}, ...]

The filtering should be done as follows:

df.filter(
( (df['A'] == l[0]['A']) & (df['B'] == l[0]['B']) )
&
( (df['A'] == l[1]['A']) & (df['B'] == l[1]['B']) )
&
...
)

How can this be done with l containing many conditions, i.e. a manual insertion into the filter condition is not practical?

I thought about using separate filter steps, i.e.:

for d in l:
    df = df.filter((df['A'] == d['A']) & (df['B'] == d['B']))

Is there a shorter or more elegant way of doing this, e.g. similar to using list comprehensions?
In addition, this does not work for ORs (|).

Upvotes: 3

Views: 1961

Answers (2)

E.Serra
E.Serra

Reputation: 1574

I don't think you want to be messing with SQL, as that might end up in a big mess (strings need to be surrounded by "", None converted to null, Nan etc)

You want to do something similar to this (myrow is your row, and columns is the set of columns you want to search that row for, if you want all columns = df.columns) :

filtered = df
for col in columns:
    if myrow.asDict()[col] is None:

         filtered = filtered.filter(filtered.__getattr__(col).isNull())
    else:
         filtered = filtered.filter(filtered.__getattr__(col).isin([myrow[col]]))
filtered.show()

Note you should add another check for is Nan but this should find the row you want, and preserve type translation between python and spark without you worrying about it (besides Nan and null that is, which are not comparable, 2 things being not a number doesn't mean they are the same)

As per filtering out certain rows which is my use case, things get hairy as all the conditions apply at the same time (not what is happening here), but hope this helps.

Upvotes: 0

murtihash
murtihash

Reputation: 8410

You could use your list of dictionaries to create a sql expression and send it to your filter all at once.

l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}]
df.show()

#+----+---+
#|   A|  B|
#+----+---+
#|val1|  5|
#|val1|  1|
#|val1|  3|
#|val4|  2|
#|val1|  4|
#|val1|  1|
#+----+---+

df.filter(' or '.join(["A"+"="+"'"+d['A']+"'"+" and "+"B"+"="+str(d['B']) for d in l])).show()

#+----+---+
#|   A|  B|
#+----+---+
#|val1|  5|
#|val4|  2|
#+----+---+

Upvotes: 3

Related Questions