Reputation: 103
This is my first Question here, please let me know if I am doing something wrong!
So I used sklearn to build an ensemble votingclassifier that contains 3 different estimators.
I first fit all 3 with the same data by calling: est.fit()
This first dataset is small because 2 out of the 3 estimators fitting is very time-consuming.
Now I want to fit the third estimator again with different data. Is there a way to achieve this?
I tryed accessing the estimator like this:
ens.estimators_[2].fit(X_largedata, y_largedata)
This does not throw an error but i am not sure if this is fitting a copy of the estimator or the one thats actually part of the ensemble.
Calling ens.predict(X_test)
after now results in the following error: (predict works fine if i dont try to fit the 3rd estimator)
ValueError Traceback (most recent call last)
<ipython-input-438-65c955f40b01> in <module>
----> 1 pred_ens2 = ens.predict(X_test_ens2)
2 print(ens.score(X_test_ens2, y_test_ens2))
3 confusion_matrix(pred_ens2, y_test_ens2).ravel()
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in predict(self, X)
280 check_is_fitted(self)
281 if self.voting == 'soft':
--> 282 maj = np.argmax(self.predict_proba(X), axis=1)
283
284 else: # 'hard' voting
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _predict_proba(self, X)
300 """Predict class probabilities for X in 'soft' voting."""
301 check_is_fitted(self)
--> 302 avg = np.average(self._collect_probas(X), axis=0,
303 weights=self._weights_not_none)
304 return avg
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _collect_probas(self, X)
295 def _collect_probas(self, X):
296 """Collect results from clf.predict calls."""
--> 297 return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
298
299 def _predict_proba(self, X):
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in <listcomp>(.0)
295 def _collect_probas(self, X):
296 """Collect results from clf.predict calls."""
--> 297 return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
298
299 def _predict_proba(self, X):
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
117
118 # lambda, but not partial, allows help() to work with update_wrapper
--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120 # update the docstring of the returned function
121 update_wrapper(out, self.fn)
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X)
461 Xt = X
462 for _, name, transform in self._iter(with_final=False):
--> 463 Xt = transform.transform(Xt)
464 return self.steps[-1][-1].predict_proba(Xt)
465
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
596 if (n_cols_transform >= n_cols_fit and
597 any(X.columns[:n_cols_fit] != self._df_columns)):
--> 598 raise ValueError('Column ordering must be equal for fit '
599 'and for transform when using the '
600 'remainder keyword')
ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword
EDIT: I fixed the error! It was caused by the small dataset having more columns than the big one. This probably is a problem, because when fitting the first time with the small dataset the transformers are told that there will be those columns(?). Once they had the same columns (and column order) it worked. It seems this is the right way to only train one specific estimator, but please let me know if there is a better way or you think I am wrong.
Upvotes: 2
Views: 698
Reputation: 9481
So, it seems that the individual classifiers are stored in a list that can be accessed with .estimators_
. The individual entries of this list are classifiers that have the .fit
method. So, example with logistic regression:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
X1, y1 = make_classification(random_state=1)
X2, y2 = make_classification(random_state=2)
clf1 = LogisticRegression(random_state=1)
clf2 = LogisticRegression(random_state=2)
clf3 = LogisticRegression(random_state=3)
voting = VotingClassifier(estimators=[
('a', clf1),
('b', clf2),
('c', clf3),
])
# fit all
voting = voting.fit(X1,y1)
# fit individual one
voting.estimators_[-1].fit(X2,y2)
voting.predict(X2)
estimators
and estimators_
This is a list of tuples, with the form (name, estimator):
for e in voting.estimators:
print(e)
('a', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=1, solver='warn', tol=0.0001, verbose=0,
warm_start=False))
('b', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=2, solver='warn', tol=0.0001, verbose=0,
warm_start=False))
('c', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=3, solver='warn', tol=0.0001, verbose=0,
warm_start=False))
This is just a list of estimators, without the names.:
for e in voting.estimators_:
print(e)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=1, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=2, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=3, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
though,
voting.estimators[0][1] == voting.estimators_[0]
evaluates to False
, so the entries do not seem to be the same.
the predict method of the voting classifier uses the .estimators_
list.
check lines 295 - 323 of the source
Upvotes: 1