Reputation: 312
Hello I've been playing around with using text analysis using scikit-learn and I had the idea of using the CountVectorizer to detect whether a document contains set of keywords and phrases.
I know that we can do this:
words = ['cat', 'dog', 'walking']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
...
cat dog walking
1 1 1
I'm wondering if it's possible to tweak things so that I can use word phrases instead of just individual words
From the example above:
phrases = ['cat in the park', 'walking my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=phrases)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
...
cat in the park walking my dog
1 1
Right now the code using the phrases just outputs
cat in the park walking my dog
0 0
Thank you in advance!
Upvotes: 3
Views: 3132
Reputation: 210852
Try this:
In [104]: lens = [len(x.split()) for x in phrases]
In [105]: mn, mx = min(lens), max(lens)
In [106]: vect = CountVectorizer(vocabulary=phrases, ngram_range=(mn, mx))
In [107]: dtm = vect.fit_transform(example)
In [108]: pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Out[108]:
cat in the park walking my dog
0 1 1
In [109]: print(mn, mx)
3 4
Upvotes: 4