Reputation: 35
There are two DFs:
df1:
Word Count
abd 1
abc 2
ad 3
df2:
words, CountSum
[ad,as,abd] 0
[abc,ad] 0
and so on.
My main goal is to iterate through each 'value' in words column of df2, match it to 'word' column in df1.
if a match is found - add df1 count to df2 CountSum.
ex. above,
in df2 - [abd,ad] are in df1.word ---> so CountSum=count(ad)+count(abd) = 4 and so on
Now, DF2 final:
words CountSum
[ad,as,abd] 4
[abc,ad] 5
Till now, I have a general idea that I would need to join the DFs and explode the arrray to get the required result but have been riddled with errors of mismatched types and column is not iterable.
Additionally, Can this be done with a built-in-method in pyspark or do I have to go with UDF?
Thanks in advance.
Upvotes: 0
Views: 456
Reputation: 15258
I would do something like this :
from pyspark.sql import functions as F
df2.withColumn("word", F.explode("words")).join(df, how="left", on="word").groupBy(
"words"
).agg(F.sum("count").alias("CountSum")).show()
+-------------+--------+
| words|CountSum|
+-------------+--------+
|[ad, as, abd]| 4|
| [abc, ad]| 5|
+-------------+--------+
Upvotes: 0