Reputation: 1
I'm learning Machine Learning and I'm facing a mismatch I can't explain.
I have a grid to compute the best model, according to the accuracy returned by GridSearchCV.
model=sklearn.neighbors.KNeighborsClassifier()
n_neighbors=[3, 4, 5, 6, 7, 8, 9]
weights=['uniform','distance']
algorithm=['auto','ball_tree','kd_tree','brute']
leaf_size=[20,30,40,50]
p=[1]
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p)
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=param_grid, cv = 5, n_jobs=1)
SGDgrid = grid.fit(data1, targetd_simp['VALUES'])
print("SGD Classifier: ")
print("Best: ")
print(SGDgrid.best_score_)
value=SGDgrid.best_score_
print("params:")
print(SGDgrid.best_params_)
print("Best estimator:")
print(SGDgrid.best_estimator_)
y_pred_train=SGDgrid.best_estimator_.predict(data1)
print(sklearn.metrics.confusion_matrix(targetd_simp['VALUES'],y_pred_train))
print(sklearn.metrics.accuracy_score(targetd_simp['VALUES'],y_pred_train))
The results I get are the following:
SGD Classifier:
Best:
0.38694539229180525
params:
{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best estimator:
KNeighborsClassifier(leaf_size=20, n_neighbors=8, p=1, weights='distance')
[[4962 0 0]
[ 0 4802 0]
[ 0 0 4853]]
1.0
Probably this model is highly overfitted. I still to check it, but it's not the matter of question here.
So, basically, if I understand correctly, GridSearchCV is finding a best accuracy score of 0.3869 (quite poor) for one of the chunks in the cross validation, but the final confusion matrix is perfect, as well as the accuracy of this final matrix. It doesn't make much sense for me... How such a in theory, bad model is performing so well?
I also added scoring = 'accuracy'
in GridSearchCV to be sure that the returned value is actually accuracy, and it returns exactly the same value.
What am I missing here?
Upvotes: 0
Views: 846
Reputation: 5164
The behavior you are describing is rather normal and to be expected. You should know that GridSearchCV
has a parameter refit
which is by default set to true. It triggers the following:
Refit an estimator using the best found parameters on the whole dataset.
This means that the estimator returned by best_estimator_
has been refit on your whole dataset (data1
in your case). It is therefore data that the estimator has already seen during training and, expectedly, performs especially well on it. You can easily reproduce this with the following example:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': [3, 4, 5]})
search.fit(X_train, y_train)
print(search.best_score_)
>>> 0.8533333333333333
print(accuracy_score(y_train, search.predict(X_train)))
>>> 0.9066666666666666
While this is not as impressive as in your case, it is still a clear result. During cross-validation, the model is validated against one fold that was not used for training the model, and thus, against data the model has not seen before. In the second case, however, the model already saw all data during training and it is to be expected that the model will perform better on them.
To get a better feeling of the true model performance, you should use a holdout set with data the model has not seen before:
print(accuracy_score(y_test, search.predict(X_test)))
>>> 0.76
As you can see, the model performs considerably worse on this data and shows us that the former metrics were all a bit too optimistic. The model did in fact not generalize that well.
In conclusion, your result is not surprising and has an easy explanation. The high discrepancy in scores is impressive but still follows the same logic and is actually just a clear indicator of overfitting.
Upvotes: 2