GridSearchCV and prediction errors analysis (scikit-learn)

Question

I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set. I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.

Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?

One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?

Thank you!

giacrava · Accepted Answer

I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!

The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.

GridSearchCV and prediction errors analysis (scikit-learn)

Answers (2)

Related Questions