Reputation: 53
I have encountered a problem in using sklearn CountVectorizer with a document that consists of a word - 'one'. I have worked out that the error occurs when the document consists of only words that are of POS tag CD (cardinal number). Following documents all result in empty vocabulary error: ['one', 'two'] ['hundred']
ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])
get error: ValueError: empty vocabulary; perhaps the documents only contain stop words
The following does NOT result in error because (I think) the cardinal number words are mixed with other words: ['one', 'two', 'people']
Interestingly though, in this case, only 'people' is added to the vocabulary, 'one', 'two' are not added:
cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}
As another example of a single word document, ['hello'] works just fine because it is not a cardinal number:
cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}
Since words like 'one', 'two' are not stop words, I would like them to be processed by CountVectorizer. How can I work with these words ?
Addition: I also got the same error for the word 'system'. Why does it error out on this word ?
cv_array = cv.fit_transform(['system'])
ValueError: empty vocabulary; perhaps the documents only contain stop words
Upvotes: 5
Views: 11004
Reputation: 40973
They reason you are getting an empty vocabulary is because those words belong to the list of stopwords sklearn uses. You can check the list here or test with:
>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
>>> 'one' in ENGLISH_STOP_WORDS
True
>>> 'two' in ENGLISH_STOP_WORDS
True
>>> 'system' in ENGLISH_STOP_WORDS
True
If you want to process those words just init your CountVectorizer like this:
cv = CountVectorizer(stop_words=None, ...
Upvotes: 2