Reputation: 4521
When I use SpaCy to identify stopwords, it doesn't work if I use the en_core_web_lg
corpus, but it does work when I use en_core_web_sm
. Is this a bug, or am I doing something wrong?
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print(f' {word} | {word.is_stop}')
Result:
The | False
cat | False
ran | False
over | False
the | False
hill | False
and | False
to | False
my | False
lap | False
However, when I change this line to use the en_core_web_sm
corpus, I get different results:
nlp = spacy.load('en_core_web_sm')
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
Upvotes: 1
Views: 4908
Reputation: 61910
The issue you have is a documented bug. The suggested workaround is the following:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_lg')
for word in STOP_WORDS:
for w in (word, word[0].capitalize(), word.upper()):
lex = nlp.vocab[w]
lex.is_stop = True
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print('{} | {}'.format(word, word.is_stop))
Output
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
Upvotes: 2
Reputation: 13106
Try from spacy.lang.en.stop_words import STOP_WORDS
, then you can explicitly check if the words are in the set
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
# Have to convert Token type to String, otherwise types won't match
print(f' {word} | {str(word) in STOP_WORDS}')
Outputs the following:
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
Looks like a bug to me. However, this approach also gives you the flexibility of adding words to the STOP_WORDS
set, if you need to
Upvotes: 0