NekoJel
NekoJel

Reputation: 63

Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word. I also have a dictionary of common stopwords.

I want to remove the rows where the word (value of the row) is in the stopwords dict.

Input:

+-------+
|  word |
+-------+
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
+-------+

stopwords = {'the', 'is', 'and'}

Expected Output:

+-------+
|  word |
+-------+
|   food|
|amazing|
|  great|
+-------+

Upvotes: 0

Views: 235

Answers (2)

blackbishop
blackbishop

Reputation: 32690

You can create dataframe using the set of stopwords then join with input dataframe using left_anti join:

stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])

result_df = input_df.join(stopwords_df, ["word"], "left_anti")

result_df.show()
#+-------+
#|   word|
#+-------+
#|amazing|
#|   food|
#|  great|
#+-------+

Upvotes: 1

vladsiv
vladsiv

Reputation: 2946

Use negative isin:

df = df.filter(~F.col("word").isin(stop_words))

where stop_words:

stop_words = {"the", "is", "and"}

Result:

+-------+                                                                       
|word   |
+-------+
|food   |
|amazing|
|great  |
+-------+

Upvotes: 2

Related Questions