Reputation: 21
When I use spaCy for cleaning data, I run the following line:
df['text'] = df.sentence.progress_apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha))
Which lemmatizes each word in the text row if the word in not a stop-word. The problem is that text.lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word. For example, if I add "friend" to the list of stop words, the output will still contain "friend" if the original token was "friends". The easy solution is to run this line twice. But that sounds silly. Anyone can suggest a solution to remove the stop words that are not in the lemmatized form in the first run?
Thanks!
Upvotes: 1
Views: 4557
Reputation: 626826
You can simply check if the token.lemma_
is present in the nlp.Defaults.stop_words
:
if token.lemma_.lower() not in nlp.Defaults.stop_words
For example:
df['text'] = df.sentence.progress_apply(
lambda text:
" ".join(
token.lemma_ for token in nlp(text)
if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
)
)
See a quick test:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.Defaults.stop_words.add("friend") # Adding "friend" to stopword list
>>> text = "I have a lot of friends"
>>> " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha)
'lot friend'
>>> " ".join(token.lemma_ for token in nlp(text) if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha)
'lot'
If you add words in uppercase to the stopword list, you will need to use if token.lemma_.lower() not in map(str.lower, nlp.Defaults.stop_words)
.
Upvotes: 2