Reputation: 1578
I want to try to optimize the parameters of a RandomForest regression model, in order to find the best trade-off between accuracy and prediction speed. My idea was to use a randomized grid search, and to evaluate the speed/accuracy of each of the tested random parameters configuration.
So, I prepared a parameter grid, and I can run k-fold cv on the training data
## parameter grid for random search
n_estimators = [1, 40, 80, 100, 120]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
rf_random.fit(X_train, y_train)
I found the way to get the parameters of the best model, by using:
rf_random.best_params_
However, I wanted to iterate through all the random models, check their parameter values, evaluate them on the test set and write the values of parameters, accuracy and speed to and output dataframe, so something like:
for model in rf_random:
start_time_base = time.time()
y_pred = model.predict(X_test) -> evaluate the current random model on the test data
time = (time.time()-start_time_base)/X_test.shape[0]
rmse = mean_squared_error(y_test, y_pred, squared=False)
params = something to get the values of the parameters for this model
write to dataframe...
Is there a way to do that? Just to be clear, I'm asking about the iteration over models and parameters, not the writing to the dataframe part :) Should I go for a different approach altogether instead?
Upvotes: 1
Views: 1524
Reputation: 658
You get the df you're looking to create with model parameters and CV results by calling rf_random.cv_results_
, which you can instantly put into a df: all_results = pd.DataFrame(rf_random.cv_results_)
.
Every time I've seen this used in practice, this has been seen as a good measure of all the metrics you're looking for; what you describe in the question is unnecessary. However if you want to go through with what you describe above (ie. evaluate against a held-out test set rather than cross-validate), you can then go through this df and define a model with each parameter combination in a loop:
for i in range(len(all_results)):
model = RandomForestRegressor(n_estimators = all_results['n_estimators'][i],
max_features = all_results['max_features'][i],
...)
model.fit(X_train, y_train)
start_time_base = time.time()
y_pred = model.predict(X_test) -> evaluate the current random model on the test data
time = (time.time()-start_time_base)/X_test.shape[0]
# Evaluate predictions however you see fit
As the trained model is only kept for the best parameter combination in RandomizedSearchCV, you'll need to retrain the models in this loop.
Upvotes: 2