gaatjeniksaan
gaatjeniksaan

Reputation: 1431

Pyspark filter out empty lists using .filter()

I have a pyspark dataframe where one column is filled with list, either containing entries or just empty lists. I want to efficiently filter out all rows that contain empty lists.

import pyspark.sql.functions as sf
df.filter(sf.col('column_with_lists') != []) 

returns me the following error:

Py4JJavaError: An error occurred while calling o303.notEqual.
: java.lang.RuntimeException: Unsupported literal type class

Perhaps I can check the length of the list and impose it should be > 0 (see here). However, I am unsure how this syntax works if I am using pyspark-sql and if filter even allows a lambda.

Perhaps to make clear, I have multiple columns but want to apply the above filter on a single one, removing all entries. The linked SO example filters on a single column.

Thanks in advance!

Upvotes: 8

Views: 13672

Answers (1)

gaatjeniksaan
gaatjeniksaan

Reputation: 1431

So it appears it is as simple as using the size function from sql.functions:

import pyspark.sql.functions as sf
df.filter(sf.size('column_with_lists') > 0)

Upvotes: 26

Related Questions