Reputation: 2570
Is it possible to know in advance if CountVectorizer
will throw
ValueError: empty vocabulary?
Basically, I have a corpus of documents and I'd like to filter out those that won't pass the CountVectorizer
(I'm using stop_words='english'
)
Thanks
Upvotes: 1
Views: 54
Reputation: 16966
You could identify those documents using build_analyzer()
. Try this!
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'this is to',
'she has'
]
analyzer = CountVectorizer(stop_words='english').build_analyzer()
filter_condtn = [True if analyzer(doc) else False for doc in corpus ]
#[True, True, False, True, False, False]
P.S. : Am too confused to see all the words in third document is in stop words.
Upvotes: 1