Reputation: 36724
I want to have a custom CountVectorizer
vocabulary to note the presence or absence of an expression. Rather than words, I want it to detect combinations of words.
Based on my custom vocabulary, I would like sklearn
to detect "big dog".
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(vocabulary=['big dog', 'cat'])
cvec.fit_transform(['The big dog and the cat']).toarray()
array([[0, 1]], dtype=int64)
It doesn't seem to detect "big dog" which is the combination of words I'm looking for. Is there a way to do this, or can this function only detect words?
Upvotes: 2
Views: 454
Reputation: 3851
You should define ngram_range bigger than (1, 1)
, for example (1, 2)
if you want sklearn
to consider combinations of 2 words.
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(vocabulary=['big dog', 'cat'], ngram_range=(1, 2))
cvec.fit_transform(['The big dog and the cat']).toarray()
array([[1, 1]], dtype=int64)
Upvotes: 2