RandomForest, how to choose the optimal n_estimator parameter

I want to train my model and choose the optimal number of trees. codes are here

from sklearn.ensemble import RandomForestClassifier

tree_dep = [3,5,6]
tree_n = [2,5,7]

avg_rf_f1 = []
search = []

for x in tree_dep:
  for y in tree_n:
    search.append((a,b))
    rf_model = RandomForestClassifier(n_estimators=tree_n, max_depth=tree_dep, random_state=42)
    rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

    avg_rf_f1.append(np.mean(rf_scores))

best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]

the error is in this line

rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

saying

ValueError: n_estimators must be an integer, got <class 'list'>.

wondering how to fix it. Thank you!!!

Upvotes: 1

Answers (2)

Luc Blassel

Reputation: 404

There is a helper function in scikit-learn called GridSearchCV that does just that. It takes a list of parameters values you want to test, and trains a classifier with all possible sets of these to return the best set of parameters.
I would suggest it is a lot cleaner and faster than the nested loop method you are implementing. It is easily extendable to other parameters (just add the desired parameters to your grid) and it can be parallelized.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params_to_test = {
    'n_estimators':[2,5,7],
    'max_depth':[3,5,6]
}

#here you can put any parameter you want at every run, like random_state or verbosity
rf_model = RandomForestClassifier(random_state=42)
#here you specify the CV parameters, number of folds, numberof cores to use...
grid_search = GridSearchCV(rf_model, param_grid=params_to_test, cv=10, scoring='f1_macro', n_jobs=4)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_ 

#best_params is a dict you can pass directly to train a model with optimal settings 
best_model = RandomForestClassifier(**best_params)

As pointed out in the comments, the best model is stored in the grid_search object, so instead of creating a new model with:

best_model = RandomForestClassifier(**best_params)

We can ust use the one in grid_search:

best_model = grid_search.best_estimator_

Upvotes: 4

Piotrek

Reputation: 1530

You iterate through the elements of the lists in your loops, but you don't use them inside the loop. Instead of providing an element from the list as n_estimators or max_depth, you provide the whole list. This should fix it, now in every iteration you take a different combination of the elements from the two lists:

from sklearn.ensemble import RandomForestClassifier

tree_dep = [3,5,6]
tree_n = [2,5,7]

avg_rf_f1 = []
search = []

for x in tree_dep:
  for y in tree_n:
    search.append((a,b))
    rf_model = RandomForestClassifier(n_estimators=y, max_depth=x, random_state=42)
    rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

    avg_rf_f1.append(np.mean(rf_scores))

best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]

Upvotes: 2

RandomForest, how to choose the optimal n_estimator parameter

Answers (2)

Related Questions