Maarten
Maarten

Reputation: 239

Pyspark: filter dataframe based on column name list

I am looking for a clean way to tackle the following problem: I want to filter a dataframe to only contain rows that have at least one value filled in for a pre-defined list of column names.

What I do not want to do is list all the conditions separately as so:

df.where(col('col1').isNotNull() | col('col2').isNotNull() | ...)

This approach is not feasible because the list of column names could get quite large and might change often.

The solution I have now is to compose the condition in a separate function:

def compose_condition(col_names):
   condition = False
   for col_name in col_names:
      condition = condition | col(col_name).isNotNull()
   return condition

And then use the result of this function as the filter condition:

my_col_names = ['col1', 'col2', 'col3']
df.where(compose_condition(my_col_names))

Are there any alternatives to my solution?

Upvotes: 1

Views: 1677

Answers (2)

blackbishop
blackbishop

Reputation: 32640

You can also use python functools.reduce to reduce the list of column names into filter condition:

from functools import reduce
from pyspark.sql.functions import col

df.where(reduce(lambda a, b: a.isNotNull() | b.isNotNull(), map(col, my_col_names)))

Upvotes: 2

mck
mck

Reputation: 42332

You can use dropna and specify how='all' to remove rows if all columns in the specified subset are null:

df2 = df.dropna(how='all', subset=my_col_names)

Upvotes: 1

Related Questions