Reputation: 9
I'm working with K-Fold Cross-Validation in a Grid Search setup for hyperparameter tuning. I have a few questions about how the model is trained and evaluated:
When I use GridSearchCV
, the model is evaluated across multiple folds (let's say 10). For each hyperparameter combination, the model is trained on ( K-1 ) folds and validated on the remaining fold. When I obtain the best_grid
model after the grid search, which specific training data (i.e., which folds) was this model trained on?
When I call best_grid.predict(X_test)
, on which dataset is this model making predictions? Has it been trained on the entire dataset after the grid search, or is it still based on the folds used during cross-validation?
If the best_grid
model has not been trained on the entire dataset yet, do I need to explicitly fit it to the full dataset again before making predictions?
I want to get the R² train score, but I'm confused about the score I receive when using the following Code:
param_grid = {f'regressor__regressor__{param}': values for param, values in model_info['params'].items()}
grid_search = GridSearchCV(full_pipeline, param_grid, cv=stratified_kf.split(X, y_binned), scoring="r2", n_jobs=4, return_train_score=True)
grid_search.fit(X,y)
if grid_search.best_score_ > best_score:
best_score = grid_search.best_score_
best_model = model_name
best_grid = grid_search
mean_train_score = best_grid.cv_results_['mean_train_score'][best_grid.best_index_] #
print(mean_train_score) # THIS THING HERE
Upvotes: -1
Views: 56
Reputation: 8152
The code you have returns the R2 score on the training set. It's there in cv_results_
, as mean_train_score
, etc.
Realize that the grid_search
object keeps track of the best score and best parameters for you, you don't need your if
block.
The grid_search
object is an estimator, and ends up being fit with the best parameters (as per the folded grid search) on all the data you give it (X
and y
in your case). This is because by default, refit=True
. So when you do predict
, you are getting the best model determined by the grid search. (This model is also available as grid_search.best_estimator_
.)
If you do grid_search.predict(X_test)
then you are predicting on X_test
. So hopefully this data was not part of X
.
Remember to refit your best model to absolutely all your data before using in production! (That is, combine X
and X_test
to fit the best model.)
Upvotes: 0