Reputation: 51
I want to train my model and choose the optimal number of trees. codes are here
from sklearn.ensemble import RandomForestClassifier
tree_dep = [3,5,6]
tree_n = [2,5,7]
avg_rf_f1 = []
search = []
for x in tree_dep:
for y in tree_n:
search.append((a,b))
rf_model = RandomForestClassifier(n_estimators=tree_n, max_depth=tree_dep, random_state=42)
rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')
avg_rf_f1.append(np.mean(rf_scores))
best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]
the error is in this line
rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')
saying
ValueError: n_estimators must be an integer, got <class 'list'>.
wondering how to fix it. Thank you!!!
Upvotes: 1
Views: 11010
Reputation: 404
There is a helper function in scikit-learn called GridSearchCV that does just that. It takes a list of parameters values you want to test, and trains a classifier with all possible sets of these to return the best set of parameters.
I would suggest it is a lot cleaner and faster than the nested loop method you are implementing. It is easily extendable to other parameters (just add the desired parameters to your grid) and it can be parallelized.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
params_to_test = {
'n_estimators':[2,5,7],
'max_depth':[3,5,6]
}
#here you can put any parameter you want at every run, like random_state or verbosity
rf_model = RandomForestClassifier(random_state=42)
#here you specify the CV parameters, number of folds, numberof cores to use...
grid_search = GridSearchCV(rf_model, param_grid=params_to_test, cv=10, scoring='f1_macro', n_jobs=4)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
#best_params is a dict you can pass directly to train a model with optimal settings
best_model = RandomForestClassifier(**best_params)
As pointed out in the comments, the best model is stored in the grid_search
object, so instead of creating a new model with:
best_model = RandomForestClassifier(**best_params)
We can ust use the one in grid_search
:
best_model = grid_search.best_estimator_
Upvotes: 4
Reputation: 1530
You iterate through the elements of the lists in your loops, but you don't use them inside the loop. Instead of providing an element from the list as n_estimators
or max_depth
, you provide the whole list. This should fix it, now in every iteration you take a different combination of the elements from the two lists:
from sklearn.ensemble import RandomForestClassifier
tree_dep = [3,5,6]
tree_n = [2,5,7]
avg_rf_f1 = []
search = []
for x in tree_dep:
for y in tree_n:
search.append((a,b))
rf_model = RandomForestClassifier(n_estimators=y, max_depth=x, random_state=42)
rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')
avg_rf_f1.append(np.mean(rf_scores))
best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]
Upvotes: 2