Not getting better results after using gridsearchCV(), rather getting better manually

Question

I was trying to learn working of gridsearchCV, by testing it on Knearistneighbors. When I assigned n_neighbors = 9 my classifier gave a score of 0.9122807017543859

but when I used gridsearchCV while giving it n_neighbors = 9, in the list,I get the score of 0.8947368421052632.

What could possibly be the reason? Any effort is appreciated. Here's my code

from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as splitter
import pickle       
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Data pre-processing  <-----------------------

data = datasets.load_breast_cancer()
p=data
add=data.target.reshape(569,1)  
columns = np.append(data.feature_names, 
                    data.target_names[0],
                    axis=None)
data = np.append(data.data,
                 add,
                 axis=1)                        
df = pd.DataFrame(data=data,columns=columns)

X_train,X_test,y_train,y_test = splitter(p.data,
                                         p.target,
                                         test_size=0.3,
                                         random_state=12)




gauss = KNeighborsClassifier(n_neighbors=9)

param_grid={'n_neighbors':[1,2,3,4,5,6,7,8,9,11,12,13,10]}

gausCV = GridSearchCV(KNeighborsClassifier(),param_grid,verbose=False)


gauss.fit(X_train,y_train)
gausCV.fit(X_train,y_train)

print(gauss.score(X_test,y_test))
print(gausCV.score(X_test,y_test))

this is what I got

0.9122807017543859
0.8947368421052632

Savage Henry · Accepted Answer

The issue is not in the number of neighbors, but in the "cross validation". The GridSearchCV process not only attempts all of the values that you have in the param_grid, but also performs some data manipulation: the "folds" of the data. This is resampling data mulitple times so as to help make the final classifier as robust to new data as possible. Given how close the scores are that you get between the gauss and gausCV models, it is almost certain that the data being drawn is affecting the results, but not heavily.

This is a good example of why just accepting a model with the highest "score" might not always be the best path: I would have greater faith in a model that scored well having gone through cross-validation than one that had not (all else equal).

Here is a good description of what is going on when you run cross-validation.

Not getting better results after using gridsearchCV(), rather getting better manually

Answers (1)

Related Questions