Norah Jones
Norah Jones

Reputation: 467

PySpark: Check if value in array is in column

I want to check if any value in array:

list = ['dog', 'mouse', 'horse', 'bird']

Appears in PySpark dataframe column:

Text isList
I like my two dogs True
I don't know if I want to have a cat False
Anna sings like a bird True
Horseland is a good place True

I found that in case of multiple words people tend to use dog|mouse|horse|bird but I have many of them and I would like to use an array. Could you help me please?

Upvotes: 0

Views: 1274

Answers (2)

blackbishop
blackbishop

Reputation: 32720

For Spark 3+, you can use any function. Create a lateral array from your list and explode it then groupby the text column and apply any :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.expr("any(lower(text) rlike word)").alias("isList")
)

df1.show(truncate=False)
#+------------------------------------+------+
#|text                                |isList|
#+------------------------------------+------+
#|I like my two dogs                  |true  |
#|Anna sings like a bird              |true  |
#|I don't know if I want to have a cat|false |
#|Horseland is a good place           |true  |
#+------------------------------------+------+

The same with max :

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.max(F.expr("lower(text) rlike word")).alias("isList")
)

If you want to check exact match, you can use arrays_overlap function:

words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])

df1 = df.withColumn(
    'isList',
    F.arrays_overlap(F.split("text", " "), words_expr)
)

Upvotes: 1

mck
mck

Reputation: 42422

If you want to use arrays, you'll need to transform the array with an rlike comparison:

import pyspark.sql.functions as F

word_list = ['dog', 'mouse', 'horse', 'bird']

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')

df2.show(20,0)
+------------------------------------+------+
|text                                |isList|
+------------------------------------+------+
|I like my two dogs                  |true  |
|I don't know if I want to have a cat|false |
|Anna sings like a bird              |true  |
|Horseland is a good place           |true  |
+------------------------------------+------+

A filter operation on the array is also possible, where you test the size of the filtered array (with matching words):

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')

If you fancy using aggregate that's also possible:

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')

Note that all three of these higher order functions require Spark >= 2.4.

Upvotes: 1

Related Questions