mgcdanny
mgcdanny

Reputation: 1202

Given a dictionary of word and frequency pairs, how to proceed with text mining in scikit

I already have word frequencies and categories like this:

y = ['animals', 'restaurants', 'sports']
x = [{'cat':1, 'dog':2}, {'food':4, 'drink':2}, {'baseball':4, 'basketball':5}]

How should I proceed with building the pipeline per the tutorial as follows:

>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

CountVectorizer is expecting a string... I guess I could create a string from the dictionary and repeat each word the number of times it occurs?

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Upvotes: 1

Views: 594

Answers (1)

elyase
elyase

Reputation: 41003

If you already have word frequencies then use a DictVectorizer:

from sklearn.feature_extraction import DictVectorizer

pipeline = Pipeline([('dvect', DictVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
model = pipeline.fit(x, y)

Then you can do:

>>> model.predict([{'cat':1}])[0]
'animals'

Upvotes: 1

Related Questions