oriKAN
oriKAN

Reputation: 9

Understanding K-Fold Cross-Validation, Model Training, and R² Scores

I'm working with K-Fold Cross-Validation in a Grid Search setup for hyperparameter tuning. I have a few questions about how the model is trained and evaluated:

  1. When I use GridSearchCV, the model is evaluated across multiple folds (let's say 10). For each hyperparameter combination, the model is trained on ( K-1 ) folds and validated on the remaining fold. When I obtain the best_grid model after the grid search, which specific training data (i.e., which folds) was this model trained on?

  2. When I call best_grid.predict(X_test), on which dataset is this model making predictions? Has it been trained on the entire dataset after the grid search, or is it still based on the folds used during cross-validation?

  3. If the best_grid model has not been trained on the entire dataset yet, do I need to explicitly fit it to the full dataset again before making predictions?

  4. I want to get the R² train score, but I'm confused about the score I receive when using the following Code:

    param_grid = {f'regressor__regressor__{param}': values for param, values in model_info['params'].items()}
         grid_search = GridSearchCV(full_pipeline, param_grid, cv=stratified_kf.split(X, y_binned), scoring="r2", n_jobs=4, return_train_score=True)
    
         grid_search.fit(X,y)
    
         if grid_search.best_score_ > best_score:
             best_score = grid_search.best_score_
             best_model = model_name
             best_grid = grid_search
    
    
     mean_train_score = best_grid.cv_results_['mean_train_score'][best_grid.best_index_] #
     print(mean_train_score) # THIS THING HERE
    

Upvotes: -1

Views: 56

Answers (1)

Matt Hall
Matt Hall

Reputation: 8152

The code you have returns the R2 score on the training set. It's there in cv_results_, as mean_train_score, etc.

Realize that the grid_search object keeps track of the best score and best parameters for you, you don't need your if block.

The grid_search object is an estimator, and ends up being fit with the best parameters (as per the folded grid search) on all the data you give it (X and y in your case). This is because by default, refit=True. So when you do predict, you are getting the best model determined by the grid search. (This model is also available as grid_search.best_estimator_.)

If you do grid_search.predict(X_test) then you are predicting on X_test. So hopefully this data was not part of X.

Remember to refit your best model to absolutely all your data before using in production! (That is, combine X and X_test to fit the best model.)

Upvotes: 0

Related Questions