Evaluating mean, stddev of cross validation scores in scikit-learn GridSearchCV

Question

I'm using Python 2.7 and scikit-learn to do some machine learning. I am using gridsearch to determine the optimal hyperparameters for my dataset and a random forest classifier. I am using leave-one-out cross validation and area under the ROC curve as the metric by which to evaulate each set of hyperparameters. My code runs, but I'm a bit confused by the output of clf.grid_scores_. From what I understand, each set of hyperparameters should be evaluated in all data folds to see how well the left-out fold was predicted using a model trained on all other folds. This will give you a an AUROC for each fold. Gridsearch should then report the mean and standard deviation over all folds for each set of hyperparameters. Using .grid_scores_ we can then view the mean, stddev, and raw values of auroc for each set of hyperparameters.

My question is why the reported mean and stddev of the cross validation scores are not equivalent to actually taking the .mean() and .std() of the reported auroc values across all the folds?

The Code:

from sklearn import cross_validation, grid_search
from sklearn.ensemble import RandomForestClassifier

lol = cross_validation.LeaveOneLabelOut(group_labels)
rf = RandomForestClassifier(random_state=42, n_jobs=96)

parameters = {'min_samples_leaf':[500,1000],
              'n_estimators': [100],
              'criterion': ['entropy',],
              'max_features': ['sqrt']
             }

clf = grid_search.GridSearchCV(rf, parameters, scoring='roc_auc', cv=lol)
clf.fit(train_features, train_labels)

for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.3f) for %r" % (scores.mean(), scores.std(), params))
print

for g in clf.grid_scores_: print g
print

print clf.best_score_
print clf.best_estimator_

The Output:

0.603 (+/-0.108) for {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 500}
0.601 (+/-0.108) for {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 1000}

mean: 0.60004, std: 0.10774, params: {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 500}
mean: 0.59705, std: 0.10821, params: {'max_features': 'sqrt', 'n_estimators': 100, 'criterion': 'entropy', 'min_samples_leaf': 1000}

0.600042993354
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_samples_leaf=500, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=96,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

Why do I calculate the mean of the first classifier as 0.603 and gridsearch reports 0.60004? (and similar disagreement for the second mean?) I feel like either I'm missing something important that will help me find the best set of hyperparams or there is a bug in sklearn.

ilyas patanam · Accepted Answer

I too was perplexed at first so I took a look at the source code. These two lines will clarify how the cross validation error is calculated:

this_score *= this_n_test_samples 
n_test_samples += this_n_test_samples

When grid search calculates the mean it is a weighted mean. Your LeaveOneLabelOut CV is most likely not balanced, that is there are different number of samples for each label. To calculate the mean validation score you will need to multiply each score by the proportion of total samples that the fold contains, and then sum all the scores.

Evaluating mean, stddev of cross validation scores in scikit-learn GridSearchCV

Answers (1)

Related Questions