PSC
PSC

Reputation: 103

Is it possible to fit one specific estimator out of an ensemble votingclassifier?

This is my first Question here, please let me know if I am doing something wrong!

So I used sklearn to build an ensemble votingclassifier that contains 3 different estimators. I first fit all 3 with the same data by calling: est.fit()
This first dataset is small because 2 out of the 3 estimators fitting is very time-consuming.

Now I want to fit the third estimator again with different data. Is there a way to achieve this?

I tryed accessing the estimator like this: ens.estimators_[2].fit(X_largedata, y_largedata)
This does not throw an error but i am not sure if this is fitting a copy of the estimator or the one thats actually part of the ensemble.
Calling ens.predict(X_test) after now results in the following error: (predict works fine if i dont try to fit the 3rd estimator)

ValueError                                Traceback (most recent call last)
<ipython-input-438-65c955f40b01> in <module>
----> 1 pred_ens2 = ens.predict(X_test_ens2)
      2 print(ens.score(X_test_ens2, y_test_ens2))
      3 confusion_matrix(pred_ens2, y_test_ens2).ravel()

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in predict(self, X)
    280         check_is_fitted(self)
    281         if self.voting == 'soft':
--> 282             maj = np.argmax(self.predict_proba(X), axis=1)
    283 
    284         else:  # 'hard' voting

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _predict_proba(self, X)
    300         """Predict class probabilities for X in 'soft' voting."""
    301         check_is_fitted(self)
--> 302         avg = np.average(self._collect_probas(X), axis=0,
    303                          weights=self._weights_not_none)
    304         return avg

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _collect_probas(self, X)
    295     def _collect_probas(self, X):
    296         """Collect results from clf.predict calls."""
--> 297         return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
    298 
    299     def _predict_proba(self, X):

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in <listcomp>(.0)
    295     def _collect_probas(self, X):
    296         """Collect results from clf.predict calls."""
--> 297         return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
    298 
    299     def _predict_proba(self, X):

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    117 
    118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    120         # update the docstring of the returned function
    121         update_wrapper(out, self.fn)

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X)
    461         Xt = X
    462         for _, name, transform in self._iter(with_final=False):
--> 463             Xt = transform.transform(Xt)
    464         return self.steps[-1][-1].predict_proba(Xt)
    465 

~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    596             if (n_cols_transform >= n_cols_fit and
    597                     any(X.columns[:n_cols_fit] != self._df_columns)):
--> 598                 raise ValueError('Column ordering must be equal for fit '
    599                                  'and for transform when using the '
    600                                  'remainder keyword')

ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword


EDIT: I fixed the error! It was caused by the small dataset having more columns than the big one. This probably is a problem, because when fitting the first time with the small dataset the transformers are told that there will be those columns(?). Once they had the same columns (and column order) it worked. It seems this is the right way to only train one specific estimator, but please let me know if there is a better way or you think I am wrong.

Upvotes: 2

Views: 698

Answers (1)

warped
warped

Reputation: 9481

So, it seems that the individual classifiers are stored in a list that can be accessed with .estimators_. The individual entries of this list are classifiers that have the .fit method. So, example with logistic regression:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

X1, y1 = make_classification(random_state=1)
X2, y2 = make_classification(random_state=2)


clf1 = LogisticRegression(random_state=1)
clf2 = LogisticRegression(random_state=2)
clf3 = LogisticRegression(random_state=3)


voting = VotingClassifier(estimators=[
    ('a', clf1),
    ('b', clf2),
    ('c', clf3),
])

# fit all
voting = voting.fit(X1,y1)

# fit individual one
voting.estimators_[-1].fit(X2,y2)
voting.predict(X2)

edit: difference between estimators and estimators_

.estimators

This is a list of tuples, with the form (name, estimator):

for e in voting.estimators:
    print(e)

('a', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=1, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False))
('b', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=2, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False))
('c', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=3, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False))

.estimators_

This is just a list of estimators, without the names.:

for e in voting.estimators_:
    print(e)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=1, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=2, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=3, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Interestingly,

though,

voting.estimators[0][1] == voting.estimators_[0] evaluates to False, so the entries do not seem to be the same.

the predict method of the voting classifier uses the .estimators_ list.

check lines 295 - 323 of the source

Upvotes: 1

Related Questions