pmdaly
pmdaly

Reputation: 1222

Why can't I get the same results as GridSearchCV?

GridSearchCV only returns a score for each parametrization and I would like to see an Roc Curve as well to better understand the results. In order to do this, I would like to take the best performing model from GridSearchCV and reproduce these same results but cache the probabilities. Here is my code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from tqdm import tqdm

import warnings
warnings.simplefilter("ignore")

data = make_classification(n_samples=100, n_features=20, n_classes=2, 
                           random_state=1, class_sep=0.1)
X, y = data


small_pipe = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100))), 
    ('clf', LogisticRegression())
])

params = {
    'clf__class_weight': ['balanced'],
    'clf__penalty'     : ['l1', 'l2'],
    'clf__C'           : [0.1, 0.5, 1.0],
    'rfs__max_features': [3, 5, 10]
}
key_feats = ['mean_train_score', 'mean_test_score', 'param_clf__C', 
             'param_clf__penalty', 'param_rfs__max_features']

skf = StratifiedKFold(n_splits=5, random_state=0)

all_results = list()
for _ in tqdm(range(25)):
    gs = GridSearchCV(small_pipe, param_grid=params, scoring='roc_auc', cv=skf, n_jobs=-1);
    gs.fit(X, y);
    results = pd.DataFrame(gs.cv_results_)[key_feats]
    all_results.append(results)


param_group = ['param_clf__C', 'param_clf__penalty', 'param_rfs__max_features']
all_results_df = pd.concat(all_results)
all_results_df.groupby(param_group).agg(['mean', 'std']
                    ).sort_values(('mean_test_score', 'mean'), ascending=False).head(20)

Here is my attempt at reproducing the results

small_pipe_w_params = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=3)), 
    ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=0.1))
])
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
for _ in range(25):
    scores = list()
    for train, test in skf.split(X, y):
        small_pipe_w_params.fit(X[train, :], y[train])
        probas = small_pipe_w_params.predict_proba(X[test, :])[:, 1]
        # cache probas here to build an Roc w/ conf interval later
        scores.append(roc_auc_score(y[test], probas))
    all_scores.extend(scores)

print('mean: {:<1.3f}, std: {:<1.3f}'.format(np.mean(all_scores), np.std(all_scores)))

I'm running the above multiple times as the results seem unstable. I have created a challenging dataset as my own dataset is equally as hard to learn. The groupby is meant to take all iterations of GridSearchCV and average & std the train and test scores to stabilize results. I then pick out the best performing model (C=0.1, penalty=l2 and max_features=3 in my most recent model) and try to reproduce these same results when I put those params in deliberately.

The GridSearchCV model yields a 0.63 mean and 0.042 std roc score whereas my own implementation gets 0.59 mean and std 0.131 roc. The grid search scores are considerably better. If I run this experiment out for 100 iterations for both GSCV and my own, the results are similar.

Why are these results not the same? They both internally use StratifiedKFold() when an integer for cv is supplied... and maybe GridSearchCV weights the scores by size of fold? I'm not sure on that, it would make sense though. Is my implementation flawed?

edit: random_state added to SKFold

Upvotes: 3

Views: 2279

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

If you set the set the random_state of the RandomForestClassifier, the variation between different girdsearchCV would be eliminated.

For simplification, I have set n_estimators =10 and got the following result

                                                             mean_train_score           mean_test_score
param_clf__C    param_clf__penalty  param_ rfs_max_features       mean        std     mean          std         
        1.0      l2                   5 0.766701    0.000000    0.580727    0.0  10 0.768849    0.000000    0.577737    0.0

Now, if see the performance on each split (by removing key_feats filtering) of the best hyper parameters, using

all_results_df.sort_values(('mean_test_score'), ascending=False).head(1).T

we will get

    16
mean_fit_time   0.228381
mean_score_time 0.113187
mean_test_score 0.580727
mean_train_score    0.766701
param_clf__C    1
param_clf__class_weight balanced
param_clf__penalty  l2
param_rfs__max_features 5
params  {'clf__class_weight': 'balanced', 'clf__penalt...
rank_test_score 1
split0_test_score   0.427273
split0_train_score  0.807051
split1_test_score   0.47
split1_train_score  0.791745
split2_test_score   0.54
split2_train_score  0.789243
split3_test_score   0.78
split3_train_score  0.769856
split4_test_score   0.7
split4_train_score  0.67561
std_fit_time    0.00586908
std_score_time  0.00152781
std_test_score  0.13555
std_train_score 0.0470554

let us reproduce this!

skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()

scores = []
weights = []


for train, test in skf.split(X, y):
    small_pipe_w_params = Pipeline([
                ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=10, 
                                                               random_state=0),max_features=5)), 
                ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=1.0,random_state=0))
            ])
    small_pipe_w_params.fit(X[train, :], y[train])
    probas = small_pipe_w_params.predict_proba(X[test, :])
    # cache probas here to build an Roc w/ conf interval later
    scores.append(roc_auc_score(y[test], probas[:,1]))
    weights.append(len(test))

print(scores)
print('mean: {:<1.6f}, std: {:<1.3f}'.format(np.average(scores, axis=0, weights=weights), np.std(scores)))

[0.42727272727272736, 0.47, 0.54, 0.78, 0.7]
mean: 0.580727, std: 0.135

Note: mean_test_score is not just simple average, its a weighted mean. Reason being iid param

From Documentation:

iid : boolean, default=’warn’ If True, return the average score across folds, weighted by the number of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. If False, return the average score across folds. Default is True, but will change to False in version 0.21, to correspond to the standard definition of cross-validation.

Changed in version 0.20: Parameter iid will change from True to False by default in version 0.22, and will be removed in 0.24.

Upvotes: 2

Related Questions