Reputation: 421
I have been using http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html
in order to cross validate a Logistic Regression classifier. The results I got are:
[ 0.78571429 0.64285714 0.85714286 0.71428571
0.78571429 0.64285714 0.84615385 0.53846154
0.76923077 0.66666667]
My primary question is how I could find which set/fold maximises my classifier's score and produces 0.857.
Follow-up question: Is training my classifier with this set a good practice?
Thank you in advance.
Upvotes: 2
Views: 1175
Reputation: 76391
whether and how I could find which set/fold maximises my classifier's score
From the documentation of cross_val_score
, you can see that it operates on a specific cv
object. (If you do not give it explicitly, then it will be KFold
in some cases, other things in other cases - refer to the documentation there.)
You can iterate over this object (or an identical one) to find the exact train/test indices. E.g., :
for tr, te in KFold(10000, 3):
# tr, te in each iteration correspond to those which gave you the scores you saw.
whether training my classifier with this set is a good practice.
Absolutely not!
The only legitimate uses of cross validation is for things like assessing overall performance, choosing between different models, or configuring model parameters.
Once you are committed to a model, you should train it over the entire training set. It is completely wrong to train it over the subset which happened to give the best score.
Upvotes: 7