Aniket Schneider
Aniket Schneider

Reputation: 944

Unexpected cross-validation scores with scikit-learn LinearRegression

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y,
    test_size=0.2, random_state=0)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)

Which yields:

0.797144744766

Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:

model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores

And I get output like this:

[ 0.04614495 -0.26160081 -3.11299397 -0.7326256  -1.04164369]

How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.

I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.

I am using sklearn version 0.16.1

Upvotes: 3

Views: 2232

Answers (3)

leonkato
leonkato

Reputation: 196

Folks, thanks for this thread.

The code in the answer above (Schneider) is outdated.

As of scikit-learn==0.19.1, this will work as expected.

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)

Best,

M.

Upvotes: 0

Felix Darvas
Felix Darvas

Reputation: 507

train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.

"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"

http://scikit-learn.org/stable/modules/cross_validation.html

Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.

Upvotes: 3

Aniket Schneider
Aniket Schneider

Reputation: 944

It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:

model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores

Which gives:

[ 0.79714474  0.86636341  0.79665689  0.8036737   0.6874571 ]

This is in line with what I would expect.

Upvotes: 4

Related Questions