Reputation: 4575
I'm trying to train and a predict a model using the code below on a dataset with around 300 records and 100 features. I'm wondering if the choices of n_estimators that I'm searching below in the code are too high? Since I've only got 300 records would it make more sense to try something like [10, 20, 30] for n_estimators? Is n_estimators related to dataset size for training data? How about the learning rate?
Code:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# TODO: Initialize the classifier
clf = AdaBoostClassifier(random_state=0)
# TODO: Create the parameters list you wish to tune
parameters = {'n_estimators':[100,200,300],'learning_rate':[1.0,2.0,4.0]}
# TODO: Make an fbeta_score scoring object
scorer = make_scorer(accuracy_score)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scoring=scorer)
# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train)
# Get the estimator
best_clf = grid_fit.best_estimator_
# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
Upvotes: 0
Views: 2722
Reputation: 1580
Let's take it one at a time:
n_estimators: I think as per the definition of n_estimators, the more is your estimators the more trees will be build and used for voting. So, yes you are doing it right by maximizing the estimators.
learning_rate: Learning rate by definition determines the impact of each tree in the output and the parameters controls the magnitude of the impact. And to add on the top of this you should start with very low learning_rate may be 0.001 or 0.01, this will make your model more robust and hence you will be able to control the variance in your dev/test set.
Hope this helps :)
Upvotes: 2