CyborgDroid
CyborgDroid

Reputation: 237

Dataframe filtering with condition applied to list of columns

I want to filter a pyspark dataframe if any of the string columns in a list are empty.

df = df.where(all([col(x)!='' for x in col_list]))

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Upvotes: 4

Views: 233

Answers (2)

Daniel
Daniel

Reputation: 1242

Since filter (or where) are lazy evaluated transformation we can merge multiple conditions by applying them one by one, e.g.

for c in col_list:
    spark_df = spark_df.filter(col(c) != "")

spark_df.show()

Which may be a bit more readable, but in the end it will be executed in a completely same way as Sreeram's answer.

On a side note, removing rows with empty values would be most often done with

df.na.drop(how="any", subset=col_list)

but it only handles missing (null / None) values, not empty strings.

Upvotes: 0

Sreeram TP
Sreeram TP

Reputation: 11917

You can use reduce from functools to simulate all like this

from functools import reduce

spark_df.where(reduce(lambda x, y: x & y,  (F.col(x) != '' for x in col_list))).show()

Upvotes: 2

Related Questions