Using custom vocabulary n-grams for sklearn CountVectorizer

Question

I want to have a custom CountVectorizer vocabulary to note the presence or absence of an expression. Rather than words, I want it to detect combinations of words.

Based on my custom vocabulary, I would like sklearn to detect "big dog".

from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(vocabulary=['big dog', 'cat'])

cvec.fit_transform(['The big dog and the cat']).toarray()

array([[0, 1]], dtype=int64)

It doesn't seem to detect "big dog" which is the combination of words I'm looking for. Is there a way to do this, or can this function only detect words?

Andrey Lukyanenko · Accepted Answer

You should define ngram_range bigger than (1, 1), for example (1, 2) if you want sklearn to consider combinations of 2 words.

from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(vocabulary=['big dog', 'cat'], ngram_range=(1, 2))

cvec.fit_transform(['The big dog and the cat']).toarray()

array([[1, 1]], dtype=int64)

Using custom vocabulary n-grams for sklearn CountVectorizer

Answers (1)

Related Questions