cgclip
cgclip

Reputation: 312

setting Sklearn's CountVectorizer's vocabulary to a dict of phrases

Hello I've been playing around with using text analysis using scikit-learn and I had the idea of using the CountVectorizer to detect whether a document contains set of keywords and phrases.

I know that we can do this:

words = ['cat', 'dog', 'walking']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 

...

   cat  dog  walking
    1    1        1

I'm wondering if it's possible to tweak things so that I can use word phrases instead of just individual words

From the example above:

phrases = ['cat in the park', 'walking my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=phrases)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
... 

       cat in the park   walking my dog
            1                   1

Right now the code using the phrases just outputs

cat in the park   walking my dog
     0                   0

Thank you in advance!

Upvotes: 3

Views: 3132

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

Try this:

In [104]: lens = [len(x.split()) for x in phrases]

In [105]: mn, mx = min(lens), max(lens)

In [106]: vect = CountVectorizer(vocabulary=phrases, ngram_range=(mn, mx))

In [107]: dtm = vect.fit_transform(example)

In [108]: pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Out[108]:
   cat in the park  walking my dog
0                1               1

In [109]: print(mn, mx)
3 4

Upvotes: 4

Related Questions