kikee1222
kikee1222

Reputation: 1996

randomized search CV not applying the selected parameters

I hope you can help

I've been trying to tune my random forest model using the randomized search function in scikit learn.

As below, I have given the option of several max depths & several leaf samples.

# Create a based model
model = RandomForestClassifier()

# Instantiate the random search model
best = RandomizedSearchCV(model, {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'min_samples_leaf': [3, 4, 5]
}, cv=5, return_train_score=True, iid=True, n_iter = 4)

best.fit(train_features, train_labels.ravel())
print(best.best_score_)
print(best)

But when I run this, I get the below, where max depth and the min samples per leaf are set to values not in my array.

What am I doing wrong here?

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            **max_depth=None**, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            **min_samples_leaf=1**, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=4, n_jobs=1,
          param_distributions={'bootstrap': [True, False], 'max_depth': [80, 90, 100, 110], 'min_samples_leaf': [3, 4, 5]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

Upvotes: 1

Views: 4411

Answers (1)

desertnaut
desertnaut

Reputation: 60321

Your chosen name for your RandomizedSearchCV object, best, is actually a misnomer: best will contain all the parameters, and not only the best ones, including the parameters of your RF model, some of which will be actually overridden during rhe randomized search. So, print(best), as expected, gives exactly this result, i.e. all the parameter values, including the default ones of RF which will actually not be used here (they will be overridden by the values in parameters).

What you should ask instead is

print(best.best_params_)

for the best found parameters, and

print(best.best_estimator_)

for the whole RF model with the best parameters found.

Here is a reproducible example using the iris data (and the name clf instead of best):

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import RandomizedSearchCV

iris = datasets.load_iris()

parameters = {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'min_samples_leaf': [3, 4, 5]
}

model = RandomForestClassifier()
clf = RandomizedSearchCV(model, parameters, cv=5, return_train_score=True, iid=True, n_iter = 4)
clf.fit(iris.data, iris.target)

Notice that, the default console output of this last fit command, even without any print request, will be:

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=4, n_jobs=None,
          param_distributions={'max_depth': [80, 90, 100, 110], 'bootstrap': [True, False], 'min_samples_leaf': [3, 4, 5]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

which is essentially the same with the one you report (and I have explained above): just the default values of your RF model (since you have not specified any parameters for model), plus the parameters grid. To get the specific parameter set selected you should use

clf.best_params_
# {'bootstrap': True, 'max_depth': 90, 'min_samples_leaf': 5}

and asking for clf.best_estimator_ confirms indeed that we get an RF with these exact parameter values:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=90, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Upvotes: 5

Related Questions