Reputation: 605
New to Pyspark, I'd like to remove some french stopwords from pyspark column.
Due to some constraint, I can't use NLTK/Spacy, StopWordsRemover
is the only option that I got.
Below is what I have tried so far without success
from pyspark.ml import *
from pyspark.ml.feature import *
stop = ['EARL ', 'EIRL ', 'EURL ', 'SARL ', 'SA ', 'SAS ', 'SASU ', 'SCI ', 'SCM ', 'SCP ']
stop = [l.lower() for l in stop]
model = Pipeline(stages = [
Tokenizer(inputCol = "name", outputCol="token"),
StopWordsRemover(inputCol="token", outputCol="stop", stopWords = stop),]).fit(df)
result = model.transform(df)
Here is the expected output
|name |stop |
|2A |2A |
|AZEJADE |AZEJADE |
|MONAZTESANTOS |MONAZTESANTOS |
|SCI SANTOS |SANTOS |
|SA FCB |FCB |
Upvotes: 1
Views: 3961
Reputation: 1109
To remove the Stopwords from dataframe, I tried Join and Filter approach: -
word_df = clean_df \
.withColumn('words',explode(split(col('course_title'), ' ')) )\
.withColumn('lowerCaseWords', lower(col("words")) ) \
.groupBy('lowerCaseWords')\
.count()
stopwords_df = spark \
.read \
.option("header",False) \
.csv("/FileStore/tables/standard/stopwords.csv") \
.withColumn("stopword", lower(col("_c0")) )
join_word_df = word_df \
.join(stopwords_df,word_df["lowerCaseWords"] == stopwords_df["stopword"],"left")
final_wordcount_df = join_word_df\
.filter(col("stopword").isNull()) \
.filter(length(col("lowerCaseWords")) != 1 ) \
.filter(length(col("lowerCaseWords")) != 0) \
.drop("stopword","_c0") \
.orderBy(col("count").desc()) \
.display()
Upvotes: 0
Reputation: 32690
The problem is that you have trailing spaces in your stop words. Also, you don't need to lowercase them unless you need the StopWordsRemover
to be case sensitive. By default it is set to false, you can change that using the parameter caseSensitive
.
Note that when you are using Tokenizer
the output will be in lowercase. If you need the output with the same case as input column name
, then it might be preferable to simply split the column name
by white spaces.
Try with this:
from pyspark.ml.feature import StopWordsRemover
import pyspark.sql.functions as F
stop = ['EARL', 'EIRL', 'EURL', 'SARL', 'SA', 'SAS', 'SASU', 'SCI', 'SCM', 'SCP']
df = spark.createDataFrame([("2A",), ("AZEJADE",), ("MONAZTESANTOS",), ("SCI SANTOS",), ("SA FCB",)], ["name"])
df = df.withColumn("tokens", F.split("name", "\\s+"))
remover = StopWordsRemover(stopWords=stop, inputCol="tokens", outputCol="stop")
result = remover.transform(df).select("name", F.array_join("stop", " ").alias("stop"))
result.show()
#+-------------+-------------+
#| name| stop|
#+-------------+-------------+
#| 2A| 2A|
#| AZEJADE| AZEJADE|
#|MONAZTESANTOS|MONAZTESANTOS|
#| SCI SANTOS| SANTOS|
#| SA FCB| FCB|
#+-------------+-------------+
Upvotes: 3