Compare two columns in different dataframes, of types String and Array respectively in pyspark

Question

There are two DFs:

df1:

Word Count
abd    1
abc    2
ad     3

df2:

words,     CountSum
[ad,as,abd] 0
[abc,ad]    0

and so on.

My main goal is to iterate through each 'value' in words column of df2, match it to 'word' column in df1.

if a match is found - add df1 count to df2 CountSum.

ex. above,

in df2 - [abd,ad] are in df1.word ---> so CountSum=count(ad)+count(abd) = 4 and so on

Now, DF2 final:

words  CountSum
[ad,as,abd] 4
[abc,ad]    5

Till now, I have a general idea that I would need to join the DFs and explode the arrray to get the required result but have been riddled with errors of mismatched types and column is not iterable.

Additionally, Can this be done with a built-in-method in pyspark or do I have to go with UDF?

Thanks in advance.

Steven · Accepted Answer

I would do something like this :

from pyspark.sql import functions as F

df2.withColumn("word", F.explode("words")).join(df, how="left", on="word").groupBy(
    "words"
).agg(F.sum("count").alias("CountSum")).show()

+-------------+--------+
|        words|CountSum|
+-------------+--------+
|[ad, as, abd]|       4|
|    [abc, ad]|       5|
+-------------+--------+

Compare two columns in different dataframes, of types String and Array<string> respectively in pyspark

Answers (1)

Related Questions

Compare two columns in different dataframes, of types String and Array&lt;string&gt; respectively in pyspark

Answers (1)

Related Questions

Compare two columns in different dataframes, of types String and Array<string> respectively in pyspark