Add Features to An Sklearn Classifier

Question

I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline:

data = fetch_20newsgroups(subset='train', categories=None)
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)
print(grid_search.best_score_)

Ale · Accepted Answer

You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces

There is a nice example in the documentation https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html which I think exactly fits your requirements. See TextStats transformer.

[Update: the example was for scikit learn =< 0.18]

Regards,

Add Features to An Sklearn Classifier

Answers (1)

Related Questions