ramobal
ramobal

Reputation: 261

sklearn random forests overwriting each other

I'm using sklearn for random forest classification. Now i want to compare different descriptor sets (one with 125 features, one with 154 features). Therefore i'm creating two different random forests, but they seem to overwrite each other which then leads to the error: 'Number of features of the model must match the input. Model n_features is 125 and input n_features is 154'

rf_std = RandomForestClassifier(n_estimators = 150, max_depth = 200, max_features = 'sqrt')
rf_nostd = RandomForestClassifier(n_estimators = 150, max_depth = 200, max_features = 'sqrt')

rf_std=rf_std.fit(X_train_std,y_train_std)
print('Testing score std:',rf_std.score(X_test_std,y_test_std))

rf_nostd=rf_nostd.fit(X_train_nostd,y_train_nostd)
print('Testing score nostd:',rf_nostd.score(X_test_nostd,y_test_nostd))
# until here it works

fig, (ax1, ax2) = plt.subplots(1, 2)

disp = plot_confusion_matrix(rf_std, X_test_std, y_test_std,
                                 cmap=plt.cm.Blues,
                                 normalize='true',ax=ax1)
disp = plot_confusion_matrix(rf_nostd, X_test_nostd, y_test_nostd,
                                 cmap=plt.cm.Blues,
                                 normalize='true',ax=ax2)
plt.show()
#here i get the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-eee9fea5dbfb> in <module>
      3 disp = plot_confusion_matrix(rf_std, X_test_std, y_test_std,
      4                                  cmap=plt.cm.Blues,
----> 5                                  normalize='true',ax=ax1)
      6 disp = plot_confusion_matrix(rf_nostd, X_test_nostd, y_test_nostd,
      7                                  cmap=plt.cm.Blues,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\_plot\confusion_matrix.py in plot_confusion_matrix(estimator, X, y_true, labels, sample_weight, normalize, display_labels, include_values, xticks_rotation, values_format, cmap, ax)
    183         raise ValueError("plot_confusion_matrix only supports classifiers")
    184 
--> 185     y_pred = estimator.predict(X)
    186     cm = confusion_matrix(y_true, y_pred, sample_weight=sample_weight,
    187                           labels=labels, normalize=normalize)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
    610             The predicted classes.
    611         """
--> 612         proba = self.predict_proba(X)
    613 
    614         if self.n_outputs_ == 1:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
    654         check_is_fitted(self)
    655         # Check data
--> 656         X = self._validate_X_predict(X)
    657 
    658         # Assign chunk of trees to jobs

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in _validate_X_predict(self, X)
    410         check_is_fitted(self)
    411 
--> 412         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    413 
    414     @property

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    389                              "match the input. Model n_features is %s and "
    390                              "input n_features is %s "
--> 391                              % (self.n_features_, n_features))
    392 
    393         return X

ValueError: Number of features of the model must match the input. Model n_features is 125 and input n_features is 154 

EDIT: Fitting the second randomforest somehow overwrites the first one like so:

rf_std=rf_std.fit(X_train_std,y_train_std)
print(rf_std.n_features_)
rf_nostd=rf_nostd.fit(X_train_nostd,y_train_nostd)
print(rf_std.n_features_)
Output:
154
125

why aren't the two models separate, can anyone help?

Upvotes: 2

Views: 371

Answers (2)

MJ029
MJ029

Reputation: 199

This generally occurs when your train/test sets doesn't match with shape. Could you please check the shape info matches for the below ?

X_train_std.shape[1] == X_test_std.shape[1]  
X_train_nostd.shape[1] == X_test_nostd.shape[1]

If it matches you are good with it, else you have to look in to the place where you find difference.

Regards,
MJ

Upvotes: 0

Nicolas Gervais
Nicolas Gervais

Reputation: 36594

I was able to reproduce this error with inconsistent train and test inputs shapes.

Try this:

assert X_train_std.shape[-1] == X_test_std.shape[-1], "Input shapes don't match."
assert X_train_nostd.shape[-1] == X_test_nostd.shape[-1], "Input shapes don't match."

This is how I reproduced your error:

import numpy as np
from sklearn.ensemble import RandomForestClassifier

X_train_std = np.random.rand(400, 154)
X_test_std = np.random.rand(100, 125)

y_train_std = np.random.randint(0, 2, 400).tolist()
y_test_std = np.random.randint(0, 2, 100).tolist()

rf_std = RandomForestClassifier(n_estimators = 150, 
    max_depth = 200, max_features = 'sqrt')

rf_std=rf_std.fit(X_train_std,y_train_std)
print('Testing score std:',rf_std.score(X_test_std,y_test_std))

ValueError: Number of features of the model must match the input. Model n_features is 154 and input n_features is 125

Upvotes: 1

Related Questions