PySpark: Filter dataframe by substring in other table

Question

I have two dataframes, a and b:

a:

+-----+---------+
| word|frequency|
+-----+---------+
|  git|        5|
|stack|       10|
|match|       15|
|other|        3|
+-----+---------+

b:

+-------------+---------+
|         word|frequency|
+-------------+---------+
|       github|        5|
|       match |        2|
|stackoverflow|       10|
|      b_entry|        7|
+-------------+---------+

I would like to filter out all rows from dataframe a where the word column is equal to or a substring of any row from b, so the desired output is:

+-----+---------+
| word|frequency|
+-----+---------+
|other|        3|
+-----+---------+

I know there are functions a.word.contains(), a.word.like(), a.word.rlike(), etc that can help me test conditions if a.word has a substring. The issue with these is that I would end up with the rows of b that contain values of a.word, instead of the rows of a that are contained by b.word. Are there any functions to help test if a.word is a substring of another table?

Ideally, a solution would be able to operate on the two tables together instead of directly iterating over the values, eg SQL Join.

Cena · Accepted Answer

Use contains() in the join condition with left_anti as the join type.
A left anti join returns all rows from the first table which do not have a match in the second table.

df_a.show()
+-----+---------+                                                               
| word|frequency|
+-----+---------+
|  git|        5|
|stack|       10|
|match|       15|
|other|        3|
+-----+---------+

df_b.show()
+-------------+-----------+
|       word_1|frequency_1|
+-------------+-----------+
|       github|          5|
|        match|          2|
|stackoverflow|         10|
|      b_entry|          7|
+-------------+-----------+

from pyspark.sql.functions import *

df_a.join(df_b, (df_b.word_1.contains(df_a.word)), "left_anti").show()
+-----+---------+
| word|frequency|
+-----+---------+
|other|        3|
+-----+---------+

PySpark: Filter dataframe by substring in other table

Answers (1)

Related Questions