Reputation: 944
I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2'
parameter to cross_val_score
.
I've tried a number of different options for the random_state
parameter to cross_validation.train_test_split
, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
Upvotes: 3
Views: 2232
Reputation: 196
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.
Upvotes: 0
Reputation: 507
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Upvotes: 3
Reputation: 944
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score
picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
Upvotes: 4