Dan
Dan

Reputation: 45741

Understanding sklearn GridSearchCV's best_score_ and best_estimator_

In the code below, I am trying to understand the connection between best_estimator_ and best_score_. I think that I should be able to get (at least a very close approximation) to best_score_ by scoring the results of best_estimator_ like so:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

classifier = GridSearchCV(LogisticRegression(penalty='l1'),
                          {'C':10**(np.linspace(1,6,num=11))},
                          scoring='neg_log_loss')

classifier.fit(X_train, y_train)

y_pred = classifier.best_estimator_.predict(X_train)
print(f'{log_loss(y_train,y_pred)}') 
print(f'{classifier.best_score_}')

However I get the following outputs (the numbers do not vary much on different runs):

7.841241697018637
-0.5470694752031108

I understand that best_score_ will be calculated as an average of the cross-validation iterations, however this should surely be a close approximation (an unbiased estimator even?) of calculating the metric on the whole set. I don't understand why they are so very different so I assume that I've made an implementation error.

How can I calculate classifier.best_score_ myself?

Upvotes: 0

Views: 3295

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

Log_loss is mostly defined for predict_proba(). I am assuming that GridSearchCV is internally calling predict_proba and then calculating the score.

Please change the predict() to predict_proba() and you will see similar results.

y_pred = classifier.best_estimator_.predict_proba(X)

print(log_loss(y_train,y_pred)) 
print(classifier.best_score_)

On iris dataset, I am getting the following output:

0.165794760809
-0.185370083771

which looks quite close.

Update:

Looks like this is the case: When you supply 'loss_loss' as a string to GridSearchCV, this is how its initialized as a scorer to be passed on to _fit_and_score() method of GridSearchCV():

log_loss_scorer = make_scorer(log_loss, greater_is_better=False,
                              needs_proba=True)

As you can see, the needs_proba is true, means that for scoring predict_proba() will be used.

Upvotes: 1

Related Questions