Reputation: 159
I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score']
,
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_
on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score
the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
Upvotes: 3
Views: 628
Reputation: 2948
The refit=True
parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_
as well.
Upvotes: 1