Reputation: 7879
I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline:
data = fetch_20newsgroups(subset='train', categories=None)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'tfidf__use_idf': (True, False),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)
print(grid_search.best_score_)
Upvotes: 3
Views: 5177
Reputation: 1345
You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces
There is a nice example in the documentation https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html which I think exactly fits your requirements. See TextStats
transformer.
[Update: the example was for scikit learn =< 0.18]
Regards,
Upvotes: 6