Reputation: 25
I want to check the number of total occurrences of list of words in a pyspark array column
ls=['aa','bb']
df
| item|
|:---- |
| ['aa','bb']|
| ['aa','bc']|
| ['ad','bc']|
if this is the dataframe the output should be
| item| count|
|:---- |:------:|
| ['aa','bb']| 2|
| ['aa','bc']| 1|
| ['ad','bc']| 0|
Upvotes: 1
Views: 356
Reputation: 4189
This can be achieved by slightly modifying the definition of the ls
variable.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
ls = [
(['aa', 'bb'],),
]
data = [
(['aa', 'bb'],),
(['aa', 'bc'],),
(['ad', 'bc'],)
]
df1 = spark.createDataFrame(ls, ['words'])
df2 = spark.createDataFrame(data, ['word_arr'])
df = df2.join(df1, how='full').select('word_arr', F.size(F.array_intersect('word_arr', 'words')).alias('count'))
df.show(truncate=False)
Upvotes: 2