Reputation: 63
hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word
. I also have a dictionary of common stopwords
.
I want to remove the rows where the word
(value of the row) is in the stopwords
dict.
Input:
+-------+
| word |
+-------+
| the|
| food|
| is|
|amazing|
| and|
| great|
+-------+
stopwords = {'the', 'is', 'and'}
Expected Output:
+-------+
| word |
+-------+
| food|
|amazing|
| great|
+-------+
Upvotes: 0
Views: 235
Reputation: 32690
You can create dataframe using the set of stopwords
then join with input dataframe using left_anti
join:
stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])
result_df = input_df.join(stopwords_df, ["word"], "left_anti")
result_df.show()
#+-------+
#| word|
#+-------+
#|amazing|
#| food|
#| great|
#+-------+
Upvotes: 1
Reputation: 2946
Use negative isin
:
df = df.filter(~F.col("word").isin(stop_words))
where stop_words
:
stop_words = {"the", "is", "and"}
Result:
+-------+
|word |
+-------+
|food |
|amazing|
|great |
+-------+
Upvotes: 2