bakun
bakun

Reputation: 475

Empty list representation in PySpark

I have a spark DataFrame with a column named "Ingredients". It has some values like:

['banana', 'apple']
['meat'] 
[]
[]

I want to look at only the []. Tried this:

display(df.filter(df.ingredients == []))

But got error:

java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList []

Upvotes: 2

Views: 4454

Answers (3)

geosmart
geosmart

Reputation: 666

try define like this code

import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("ids",F.lit(None).astype(T.ArrayType(T.StringType())))

the ids will stored as None the ids's dtype is array<string>

and query with spark-sql like

select *  from tb1 where ids is not null

Upvotes: 0

blackbishop
blackbishop

Reputation: 32670

Adding further to to @mck's answer, sometimes you have an array which contains only one empty string and it is also shown like 'empty array'. Here's an example :

df = spark.createDataFrame([([''],)], ['value'])

df.show()

# +-----+
# |value|
# +-----+
# |   []|
# +-----+

df.filter(F.col("value") == F.array(F.lit(""))).show()

# +-----+
# |value|
# +-----+
# |   []|
# +-----+

df.filter(F.col("value") != F.array(F.lit(""))).show()

# +-----+
# |value|
# +-----+
# +-----+

In this case F.col("value") == F.array() won't work.

Upvotes: 2

mck
mck

Reputation: 42352

You can specify an empty array to compare:

import pyspark.sql.functions as F

display(df.filter(df.ingredients == F.array()))

Or you can check the array length is zero:

display(df.filter(F.size(df.ingredients) == 0))

Upvotes: 3

Related Questions