Guddi
Guddi

Reputation: 53

CountVectorizer gives empty vocabulary error is document is cardinal number

I have encountered a problem in using sklearn CountVectorizer with a document that consists of a word - 'one'. I have worked out that the error occurs when the document consists of only words that are of POS tag CD (cardinal number). Following documents all result in empty vocabulary error: ['one', 'two'] ['hundred']

ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])

get error: ValueError: empty vocabulary; perhaps the documents only contain stop words

The following does NOT result in error because (I think) the cardinal number words are mixed with other words: ['one', 'two', 'people']

Interestingly though, in this case, only 'people' is added to the vocabulary, 'one', 'two' are not added:

cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}

As another example of a single word document, ['hello'] works just fine because it is not a cardinal number:

cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}

Since words like 'one', 'two' are not stop words, I would like them to be processed by CountVectorizer. How can I work with these words ?

Addition: I also got the same error for the word 'system'. Why does it error out on this word ?

cv_array = cv.fit_transform(['system'])

ValueError: empty vocabulary; perhaps the documents only contain stop words

Upvotes: 5

Views: 11004

Answers (1)

elyase
elyase

Reputation: 40973

They reason you are getting an empty vocabulary is because those words belong to the list of stopwords sklearn uses. You can check the list here or test with:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

>>> 'one' in ENGLISH_STOP_WORDS 
True

>>> 'two' in ENGLISH_STOP_WORDS 
True

>>> 'system' in ENGLISH_STOP_WORDS 
True

If you want to process those words just init your CountVectorizer like this:

cv = CountVectorizer(stop_words=None, ...

Upvotes: 2

Related Questions