Reputation: 113
I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators
hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score
? Thanks.
Upvotes: 5
Views: 33087
Reputation: 123
Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.
First, you already answer it from your code, in this lines.
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
As you can see, you already saved the accuracy_score into scores
. So you just need to recall it by find the maximum value from the socres's list.
maxs = max(scores)
maxs_idx = scores.index(maxs)
Then just put the print command in the final lines.
print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")
I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.
Upvotes: -2
Reputation: 6080
don't use gridsearch
for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.
there is a stage_predict
attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)
# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Upvotes: 0
Reputation: 76
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
Upvotes: 0
Reputation: 4253
see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example
I used RandomSearchCV to find the best params for the Random Forest Classifier
n_estimators is the number of decision trees to use.
try using XBBoost to get more accuracy.
parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf':
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}
number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)
random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)
print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
Upvotes: 0