Reputation: 239
I am looking for a clean way to tackle the following problem: I want to filter a dataframe to only contain rows that have at least one value filled in for a pre-defined list of column names.
What I do not want to do is list all the conditions separately as so:
df.where(col('col1').isNotNull() | col('col2').isNotNull() | ...)
This approach is not feasible because the list of column names could get quite large and might change often.
The solution I have now is to compose the condition in a separate function:
def compose_condition(col_names):
condition = False
for col_name in col_names:
condition = condition | col(col_name).isNotNull()
return condition
And then use the result of this function as the filter condition:
my_col_names = ['col1', 'col2', 'col3']
df.where(compose_condition(my_col_names))
Are there any alternatives to my solution?
Upvotes: 1
Views: 1677
Reputation: 32640
You can also use python functools.reduce
to reduce the list of column names into filter condition:
from functools import reduce
from pyspark.sql.functions import col
df.where(reduce(lambda a, b: a.isNotNull() | b.isNotNull(), map(col, my_col_names)))
Upvotes: 2
Reputation: 42332
You can use dropna
and specify how='all'
to remove rows if all columns in the specified subset are null:
df2 = df.dropna(how='all', subset=my_col_names)
Upvotes: 1