PySpark: Compare array values in one dataFrame with array values in another dataFrame to get the intersection

Question

I have the following two DataFrames:

l1 = [(['hello','world'],), (['stack','overflow'],), (['hello', 'alice'],), (['sample', 'text'],)]
df1 = spark.createDataFrame(l1)

l2 = [(['big','world'],), (['sample','overflow', 'alice', 'text', 'bob'],), (['hello', 'sample'],)]
df2 = spark.createDataFrame(l2)

df1:

["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]

df2:

["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]

For every row in df1, I want to calculate the number of times all the words in the array occur in df2.

For example, the first row in df1 is ["hello","world"]. Now, I want to check df2 for the intersection of ["hello","world"] with every row in df2.

|                  ARRAY                    | INTERSECTION | LEN(INTERSECTION)| 
|["big","world"]                            |["world"]     | 1                |
|["sample","overflow","alice","text","bob"] |[]            | 0                |   
|["hello","sample"]                         |["hello"]     | 1                |

Now, I want to return the sum(len(interesection)). Ultimately I want the resulting df1 to look like this:

df1 result:

       ARRAY               INTERSECTION_TOTAL
| ["hello","world"]    |      2                 |
| ["stack","overflow"] |      1                 |
| ["hello","alice"]    |      2                 |
| ["sample","text"]    |      3                 |

How do I solve this?

PySpark: Compare array values in one dataFrame with array values in another dataFrame to get the intersection

Answers (1)

Related Questions