Reputation: 1202
I already have word frequencies and categories like this:
y = ['animals', 'restaurants', 'sports']
x = [{'cat':1, 'dog':2}, {'food':4, 'drink':2}, {'baseball':4, 'basketball':5}]
How should I proceed with building the pipeline per the tutorial as follows:
>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
... ])
>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
CountVectorizer is expecting a string... I guess I could create a string from the dictionary and repeat each word the number of times it occurs?
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Upvotes: 1
Views: 594
Reputation: 41003
If you already have word frequencies then use a DictVectorizer:
from sklearn.feature_extraction import DictVectorizer
pipeline = Pipeline([('dvect', DictVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
model = pipeline.fit(x, y)
Then you can do:
>>> model.predict([{'cat':1}])[0]
'animals'
Upvotes: 1