Reputation: 467
I want to check if any value in array:
list = ['dog', 'mouse', 'horse', 'bird']
Appears in PySpark dataframe column:
Text | isList |
---|---|
I like my two dogs | True |
I don't know if I want to have a cat | False |
Anna sings like a bird | True |
Horseland is a good place | True |
I found that in case of multiple words people tend to use dog|mouse|horse|bird
but I have many of them and I would like to use an array. Could you help me please?
Upvotes: 0
Views: 1274
Reputation: 32720
For Spark 3+, you can use any function. Create a lateral array from your list and explode it then groupby the text column and apply any :
from pyspark.sql import functions as F
df1 = df.withColumn(
"word",
F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
F.expr("any(lower(text) rlike word)").alias("isList")
)
df1.show(truncate=False)
#+------------------------------------+------+
#|text |isList|
#+------------------------------------+------+
#|I like my two dogs |true |
#|Anna sings like a bird |true |
#|I don't know if I want to have a cat|false |
#|Horseland is a good place |true |
#+------------------------------------+------+
The same with max
:
df1 = df.withColumn(
"word",
F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
F.max(F.expr("lower(text) rlike word")).alias("isList")
)
If you want to check exact match, you can use arrays_overlap
function:
words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])
df1 = df.withColumn(
'isList',
F.arrays_overlap(F.split("text", " "), words_expr)
)
Upvotes: 1
Reputation: 42422
If you want to use arrays, you'll need to transform
the array with an rlike
comparison:
import pyspark.sql.functions as F
word_list = ['dog', 'mouse', 'horse', 'bird']
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')
df2.show(20,0)
+------------------------------------+------+
|text |isList|
+------------------------------------+------+
|I like my two dogs |true |
|I don't know if I want to have a cat|false |
|Anna sings like a bird |true |
|Horseland is a good place |true |
+------------------------------------+------+
A filter
operation on the array is also possible, where you test the size of the filtered array (with matching words):
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')
If you fancy using aggregate
that's also possible:
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')
Note that all three of these higher order functions require Spark >= 2.4.
Upvotes: 1