Reputation: 360
I'm using scikit-learn cross_validation and get for example 0.82 mean score (r2_scorer
).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Upvotes: 1
Views: 2463
Reputation: 40149
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score
tool only reports test scores.
You can setup your own loop with the train_test_split
function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
Upvotes: 7
Reputation: 1967
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)
Upvotes: 0