Antoine Deleuze
Antoine Deleuze

Reputation: 37

GridSearchCV issue with sklearn

I am currently working on a text classifier and using GridsearchCV from sklearn to get the best hyper-parameters for my classifiers. However, there is something I don't understand in the "best_score" returned by gridsearch:

f=open('cleaned_data.pkl','rb')
X=pickle.load(f)
f.close()

f=open('cleaned_targets.pkl','rb')
Y=pickle.load(f)
f.close()

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.01,random_state=1,stratify=Y)

test_param_gamma=[i for i in np.arange(0.1,0.6,0.1)]
test_param_C=[i for i in np.arange(4,4.5,0.1)]

count_vect = CountVectorizer(stop_words='english')
tfidf_transformer = TfidfTransformer()

parameters = {'clf2__gamma': test_param_gamma,'clf2__C':test_param_C}
nb=Pipeline([['cv',count_vect],['tfidf',tfidf_transformer],['clf2',SVC()]]) 
gs_clf2 = GridSearchCV(nb, parameters,verbose=10)
gs_clf2 = gs_clf2.fit(X_train, Y_train)
print(gs_clf2.best_score_)
print(gs_clf2.best_params_)

If I fit my gridsearch using X_train and Y_train that are smaller data sets than X and Y (but just a little because I used a test_size of 0.01), I get a 10 points higher best_score than if I train it with the entire data sets, that is to say:

gs_clf2 = gs_clf2.fit(X, Y)

My questions are:

  1. Why is my classifier better with a smaller data set ?
  2. Why is there such a big difference of efficiency for a data set that has only got something like 20 samples more ?

NB: I observe the same evolutions using Naive Bayes classifiers... I have tried several value for test_size and it seems it has not a significant impact on best_score, there is something I don't understand.

Thank you in advance !

Antoine

Upvotes: 0

Views: 424

Answers (1)

Antoine Deleuze
Antoine Deleuze

Reputation: 37

Ok, I just found the answer by looking at my variables and it may be useful:

train_test_split from sklearn allows you to split your data set into two data sets, one for training and one for testing.

Nevertheless, it also performs a shuffle so that you will not have a train/test targets set like [1 1 1 1 1 3 3 3 3 3 3 3 2 2 2...] (my data were sorted this way) but something like [1 1 3 2 2 3 1 3 2 1...].

So when Gridsearch will perform cross-validation, there will be diversity in the folds it uses. If you don't shuffle your data, the folds will certainly look like [1 1 1 1 1 3 3],[3 3 3 3 2 2],... etc

Trick to shuffle two lists at once with same order:

from random import shuffle
c=list(zip(X,Y))
shuffle(c)
X,Y=zip(*c)

Upvotes: 0

Related Questions