How to tell in advance if CountVectorizer will throw ValueError: empty vocabulary?

Question

Is it possible to know in advance if CountVectorizer will throw

ValueError: empty vocabulary?

Basically, I have a corpus of documents and I'd like to filter out those that won't pass the CountVectorizer (I'm using stop_words='english')

Thanks

Venkatachalam · Accepted Answer

You could identify those documents using build_analyzer(). Try this!

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'this is to',
    'she has'
]
analyzer = CountVectorizer(stop_words='english').build_analyzer()
filter_condtn = [True if analyzer(doc) else False for doc in corpus ]

#[True, True, False, True, False, False]

P.S. : Am too confused to see all the words in third document is in stop words.

How to tell in advance if CountVectorizer will throw ValueError: empty vocabulary?

Answers (1)

Related Questions