PySpark: Check if value in array is in column

Question

I want to check if any value in array:

list = ['dog', 'mouse', 'horse', 'bird']

Appears in PySpark dataframe column:

Text	isList
I like my two dogs	True
I don't know if I want to have a cat	False
Anna sings like a bird	True
Horseland is a good place	True

I found that in case of multiple words people tend to use dog|mouse|horse|bird but I have many of them and I would like to use an array. Could you help me please?

blackbishop · Accepted Answer

For Spark 3+, you can use any function. Create a lateral array from your list and explode it then groupby the text column and apply any :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.expr("any(lower(text) rlike word)").alias("isList")
)

df1.show(truncate=False)
#+------------------------------------+------+
#|text                                |isList|
#+------------------------------------+------+
#|I like my two dogs                  |true  |
#|Anna sings like a bird              |true  |
#|I don't know if I want to have a cat|false |
#|Horseland is a good place           |true  |
#+------------------------------------+------+

The same with max :

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.max(F.expr("lower(text) rlike word")).alias("isList")
)

If you want to check exact match, you can use arrays_overlap function:

words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])

df1 = df.withColumn(
    'isList',
    F.arrays_overlap(F.split("text", " "), words_expr)
)

PySpark: Check if value in array is in column

Answers (2)

Related Questions