Reza Mousavi
Reza Mousavi

Reputation: 21

How to remove stop words and lemmatize at the same time when using spaCy?

When I use spaCy for cleaning data, I run the following line:

df['text'] = df.sentence.progress_apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha))

Which lemmatizes each word in the text row if the word in not a stop-word. The problem is that text.lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word. For example, if I add "friend" to the list of stop words, the output will still contain "friend" if the original token was "friends". The easy solution is to run this line twice. But that sounds silly. Anyone can suggest a solution to remove the stop words that are not in the lemmatized form in the first run?

Thanks!

Upvotes: 1

Views: 4557

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626826

You can simply check if the token.lemma_ is present in the nlp.Defaults.stop_words:

if token.lemma_.lower() not in nlp.Defaults.stop_words

For example:

df['text'] = df.sentence.progress_apply(
    lambda text: 
        " ".join(
            token.lemma_ for token in nlp(text)
                if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
        )
)

See a quick test:

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")

>>> nlp.Defaults.stop_words.add("friend") # Adding "friend" to stopword list

>>> text = "I have a lot of friends"
>>> " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha)
'lot friend'

>>> " ".join(token.lemma_ for token in nlp(text) if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha)
'lot'

If you add words in uppercase to the stopword list, you will need to use if token.lemma_.lower() not in map(str.lower, nlp.Defaults.stop_words).

Upvotes: 2

Related Questions