Reputation: 3854
I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data
with columns X
and y
):
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
("rdf", RandomForestClassifier())])
Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:
train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333
I want to do this three times for three different test sets, but using cross_val_score
gives me results that are much lower.
ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069 0.36729223 0.22977941]
As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score
method. So why are they all so much lower?
Upvotes: 18
Views: 5886
Reputation: 3854
I solved this problem in the process of writing my question, so here it goes:
The default behavior for cross_val_score
is to use KFold
or StratifiedKFold
to define the folds. By default, both have argument shuffle=False
, so the folds are not pulled randomly from the data:
import numpy as np
import sklearn.model_selection as ms
for i, j in ms.KFold().split(np.arange(9)):
print("TRAIN:", i, "TEST:", j)
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of KFold
(I was doing classification, so StratifiedKFold
was the default):
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold())
# array([ 0.05530776, 0.05709188, 0.025 ])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False))
# array([ 0.2978355 , 0.35924933, 0.27205882])
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True))
# array([ 0.51561106, 0.50579839, 0.51785714])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True))
# array([ 0.52869565, 0.54423592, 0.55626715])
Doing things by hand was giving me higher scores because train_test_split
was doing the same thing as KFold(shuffle = True)
.
Upvotes: 18