CountVectorizer gives empty vocabulary error is document is cardinal number

Question

I have encountered a problem in using sklearn CountVectorizer with a document that consists of a word - 'one'. I have worked out that the error occurs when the document consists of only words that are of POS tag CD (cardinal number). Following documents all result in empty vocabulary error: ['one', 'two'] ['hundred']

ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])

get error: ValueError: empty vocabulary; perhaps the documents only contain stop words

The following does NOT result in error because (I think) the cardinal number words are mixed with other words: ['one', 'two', 'people']

Interestingly though, in this case, only 'people' is added to the vocabulary, 'one', 'two' are not added:

cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}

As another example of a single word document, ['hello'] works just fine because it is not a cardinal number:

cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}

Since words like 'one', 'two' are not stop words, I would like them to be processed by CountVectorizer. How can I work with these words ?

Addition: I also got the same error for the word 'system'. Why does it error out on this word ?

cv_array = cv.fit_transform(['system'])

ValueError: empty vocabulary; perhaps the documents only contain stop words

elyase · Accepted Answer

They reason you are getting an empty vocabulary is because those words belong to the list of stopwords sklearn uses. You can check the list here or test with:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

>>> 'one' in ENGLISH_STOP_WORDS 
True

>>> 'two' in ENGLISH_STOP_WORDS 
True

>>> 'system' in ENGLISH_STOP_WORDS 
True

If you want to process those words just init your CountVectorizer like this:

cv = CountVectorizer(stop_words=None, ...

CountVectorizer gives empty vocabulary error is document is cardinal number

Answers (1)

Related Questions