Reputation: 3
I have two dataframes, a and b:
a:
+-----+---------+
| word|frequency|
+-----+---------+
| git| 5|
|stack| 10|
|match| 15|
|other| 3|
+-----+---------+
b:
+-------------+---------+
| word|frequency|
+-------------+---------+
| github| 5|
| match | 2|
|stackoverflow| 10|
| b_entry| 7|
+-------------+---------+
I would like to filter out all rows from dataframe a where the word column is equal to or a substring of any row from b, so the desired output is:
+-----+---------+
| word|frequency|
+-----+---------+
|other| 3|
+-----+---------+
I know there are functions a.word.contains(), a.word.like(), a.word.rlike(), etc that can help me test conditions if a.word has a substring. The issue with these is that I would end up with the rows of b that contain values of a.word, instead of the rows of a that are contained by b.word. Are there any functions to help test if a.word is a substring of another table?
Ideally, a solution would be able to operate on the two tables together instead of directly iterating over the values, eg SQL Join.
Upvotes: 0
Views: 1378
Reputation: 3419
Use contains()
in the join condition with left_anti
as the join type.
A left anti join returns all rows from the first table which do not have a match in the second table.
df_a.show()
+-----+---------+
| word|frequency|
+-----+---------+
| git| 5|
|stack| 10|
|match| 15|
|other| 3|
+-----+---------+
df_b.show()
+-------------+-----------+
| word_1|frequency_1|
+-------------+-----------+
| github| 5|
| match| 2|
|stackoverflow| 10|
| b_entry| 7|
+-------------+-----------+
from pyspark.sql.functions import *
df_a.join(df_b, (df_b.word_1.contains(df_a.word)), "left_anti").show()
+-----+---------+
| word|frequency|
+-----+---------+
|other| 3|
+-----+---------+
Upvotes: 1