Reputation: 169
I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set. I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.
Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?
One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?
Thank you!
Upvotes: 0
Views: 820
Reputation: 169
I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!
The solution that should work is to use cross_val_predict
splitting the fold in the same way as done in GridSearchCV
. In fact, cross_val_predict
re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV
.
Upvotes: 1
Reputation: 253
You can compute a validation curve with the model that you obtained from GridSearchCV. Read the documentation here. You will just need to define arrays for the hyperparameters that you want to inspect and a scoring function. Here is an example:
train_scores, valid_scores = validation_curve(model, X_train, y_train, "alpha", np.logspace(-7, 3, 3), cv=5, scoring="accuracy")
Upvotes: 1