Reputation: 1468
I was trying to learn working of gridsearchCV, by testing it on Knearistneighbors. When I assigned n_neighbors = 9 my classifier gave a score of 0.9122807017543859
but when I used gridsearchCV while giving it n_neighbors = 9, in the list,I get the score of 0.8947368421052632.
What could possibly be the reason? Any effort is appreciated. Here's my code
from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as splitter
import pickle
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Data pre-processing <-----------------------
data = datasets.load_breast_cancer()
p=data
add=data.target.reshape(569,1)
columns = np.append(data.feature_names,
data.target_names[0],
axis=None)
data = np.append(data.data,
add,
axis=1)
df = pd.DataFrame(data=data,columns=columns)
X_train,X_test,y_train,y_test = splitter(p.data,
p.target,
test_size=0.3,
random_state=12)
gauss = KNeighborsClassifier(n_neighbors=9)
param_grid={'n_neighbors':[1,2,3,4,5,6,7,8,9,11,12,13,10]}
gausCV = GridSearchCV(KNeighborsClassifier(),param_grid,verbose=False)
gauss.fit(X_train,y_train)
gausCV.fit(X_train,y_train)
print(gauss.score(X_test,y_test))
print(gausCV.score(X_test,y_test))
this is what I got
0.9122807017543859
0.8947368421052632
Upvotes: 3
Views: 1613
Reputation: 2069
The issue is not in the number of neighbors, but in the "cross validation". The GridSearchCV
process not only attempts all of the values that you have in the param_grid
, but also performs some data manipulation: the "folds" of the data. This is resampling data mulitple times so as to help make the final classifier as robust to new data as possible. Given how close the scores are that you get between the gauss
and gausCV
models, it is almost certain that the data being drawn is affecting the results, but not heavily.
This is a good example of why just accepting a model with the highest "score" might not always be the best path: I would have greater faith in a model that scored well having gone through cross-validation than one that had not (all else equal).
Here is a good description of what is going on when you run cross-validation.
Upvotes: 2