Reputation: 5477
I want to use the VotingClassifier
inside a sklearn Pipeline
, where I defined a set of classifiers ..
I got some intuition from this question: Using VotingClassifier
in Sklearn Pipeline to build the code below, but in this question each of the classifiers are defined in an independent Pipeline .. I don't want to use it in this way, where I have a set of features are prepared before and its not a good idea to repeat the generation of these features in multi Pipelines with different classsifiers (Time-consuming process)!
How could I achieve that?!
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3, 5), min_df=0.01, lowercase=True, tokenizer=tokenizeTfidf)),
])),
('pip1', Pipeline([('clf1', GradientBoostingClassifier(n_estimators=1000, random_state=7))])),
('pip2', Pipeline([('clf2', SVC())])),
('pip3', Pipeline([('clf3', RandomForestClassifier())])),
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"]))
])
clf = model.fit(X_train, y_train)
but I got this error:
('clf', VotingClassifier(estimators=["pip1", "pip2", "pip3"])),
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 115, in __init__
self._validate_steps()
File "C:\Python35\lib\site-packages\imblearn\pipeline.py", line 139, in _validate_steps
"(but not both) '%s' (type %s) doesn't)" % (t, type(t)))
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or sample (but not both) 'Pipeline(memory=None,
steps=[('clf1', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
presort='auto', random_state=7, subsample=1.0, verbose=0,
warm_start=False))])' (type <class 'imblearn.pipeline.Pipeline'>) doesn't)
Upvotes: 3
Views: 1390
Reputation: 36599
I am assuming you want to do something like this:
1) Transform the text data to tfidf using TfidfVectorizer 2) Send the transformed data to the 3 estimators (GradientBoostingClassifier, SVC, RandomForestClassifier) and then use voting to get the predictions.
If this is the case, this is what you need.
model = Pipeline([
('feat', FeatureUnion([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
])),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])
Also, if you are only using the single TfidfVectorizer
and not combining any other features with it, you dont even need the FeatureUnion
:
model = Pipeline([
('tfidf', TfidfVectorizer(analyzer='char',
ngram_range=(3, 5),
min_df=0.01,
lowercase=True,
tokenizer=tokenizeTfidf)),
('clf', VotingClassifier(estimators=[("pip1", GradientBoostingClassifier(n_estimators=1000,
random_state=7)),
("pip2", SVC()),
("pip3", RandomForestClassifier())]))
])
Upvotes: 10