Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

Question

hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word. I also have a dictionary of common stopwords.

I want to remove the rows where the word (value of the row) is in the stopwords dict.

Input:

+-------+
|  word |
+-------+
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
+-------+

stopwords = {'the', 'is', 'and'}

Expected Output:

+-------+
|  word |
+-------+
|   food|
|amazing|
|  great|
+-------+

vladsiv · Accepted Answer

Use negative isin:

df = df.filter(~F.col("word").isin(stop_words))

where stop_words:

stop_words = {"the", "is", "and"}

Result:

+-------+                                                                       
|word   |
+-------+
|food   |
|amazing|
|great  |
+-------+

Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

Answers (2)

Related Questions