IsaacLevon
IsaacLevon

Reputation: 2570

How to tell in advance if CountVectorizer will throw ValueError: empty vocabulary?

Is it possible to know in advance if CountVectorizer will throw

ValueError: empty vocabulary?

Basically, I have a corpus of documents and I'd like to filter out those that won't pass the CountVectorizer (I'm using stop_words='english')

Thanks

Upvotes: 1

Views: 54

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

You could identify those documents using build_analyzer(). Try this!

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'this is to',
    'she has'
]
analyzer = CountVectorizer(stop_words='english').build_analyzer()
filter_condtn = [True if analyzer(doc) else False for doc in corpus ]

#[True, True, False, True, False, False]

P.S. : Am too confused to see all the words in third document is in stop words.

Upvotes: 1

Related Questions