Pyspark: filter dataframe based on column name list

Question

I am looking for a clean way to tackle the following problem: I want to filter a dataframe to only contain rows that have at least one value filled in for a pre-defined list of column names.

What I do not want to do is list all the conditions separately as so:

df.where(col('col1').isNotNull() | col('col2').isNotNull() | ...)

This approach is not feasible because the list of column names could get quite large and might change often.

The solution I have now is to compose the condition in a separate function:

def compose_condition(col_names):
   condition = False
   for col_name in col_names:
      condition = condition | col(col_name).isNotNull()
   return condition

And then use the result of this function as the filter condition:

my_col_names = ['col1', 'col2', 'col3']
df.where(compose_condition(my_col_names))

Are there any alternatives to my solution?

mck · Accepted Answer

You can use dropna and specify how='all' to remove rows if all columns in the specified subset are null:

df2 = df.dropna(how='all', subset=my_col_names)

Pyspark: filter dataframe based on column name list

Answers (2)

Related Questions