scikit-learn cross-validation over-fitting or under-fitting

Question

I'm using scikit-learn cross_validation and get for example 0.82 mean score (r2_scorer). How could I know do I have over-fitting or under-fitting using scikit-learn functions?

ogrisel · Accepted Answer

Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.

You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.

import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS

scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
    regressor.fit(X[train], y[train])
    train_scores.append(scorer(regressor, X[train], y[train]))
    test_scores.append(scorer(regressor, X[test], y[test]))

mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)

If you compute the mean train and test scores with cross validation you can then find out if you are:

Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).

Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.

scikit-learn cross-validation over-fitting or under-fitting

Answers (2)

Related Questions