Reputation: 1222
GridSearchCV
only returns a score for each parametrization and I would like to see an Roc Curve as well to better understand the results. In order to do this, I would like to take the best performing model from GridSearchCV
and reproduce these same results but cache the probabilities. Here is my code
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from tqdm import tqdm
import warnings
warnings.simplefilter("ignore")
data = make_classification(n_samples=100, n_features=20, n_classes=2,
random_state=1, class_sep=0.1)
X, y = data
small_pipe = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100))),
('clf', LogisticRegression())
])
params = {
'clf__class_weight': ['balanced'],
'clf__penalty' : ['l1', 'l2'],
'clf__C' : [0.1, 0.5, 1.0],
'rfs__max_features': [3, 5, 10]
}
key_feats = ['mean_train_score', 'mean_test_score', 'param_clf__C',
'param_clf__penalty', 'param_rfs__max_features']
skf = StratifiedKFold(n_splits=5, random_state=0)
all_results = list()
for _ in tqdm(range(25)):
gs = GridSearchCV(small_pipe, param_grid=params, scoring='roc_auc', cv=skf, n_jobs=-1);
gs.fit(X, y);
results = pd.DataFrame(gs.cv_results_)[key_feats]
all_results.append(results)
param_group = ['param_clf__C', 'param_clf__penalty', 'param_rfs__max_features']
all_results_df = pd.concat(all_results)
all_results_df.groupby(param_group).agg(['mean', 'std']
).sort_values(('mean_test_score', 'mean'), ascending=False).head(20)
Here is my attempt at reproducing the results
small_pipe_w_params = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=3)),
('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=0.1))
])
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
for _ in range(25):
scores = list()
for train, test in skf.split(X, y):
small_pipe_w_params.fit(X[train, :], y[train])
probas = small_pipe_w_params.predict_proba(X[test, :])[:, 1]
# cache probas here to build an Roc w/ conf interval later
scores.append(roc_auc_score(y[test], probas))
all_scores.extend(scores)
print('mean: {:<1.3f}, std: {:<1.3f}'.format(np.mean(all_scores), np.std(all_scores)))
I'm running the above multiple times as the results seem unstable. I have created a challenging dataset as my own dataset is equally as hard to learn. The groupby is meant to take all iterations of GridSearchCV
and average & std the train and test scores to stabilize results. I then pick out the best performing model (C=0.1, penalty=l2 and max_features=3 in my most recent model) and try to reproduce these same results when I put those params in deliberately.
The GridSearchCV
model yields a 0.63 mean and 0.042 std roc score whereas my own implementation gets 0.59 mean and std 0.131 roc. The grid search scores are considerably better. If I run this experiment out for 100 iterations for both GSCV and my own, the results are similar.
Why are these results not the same? They both internally use StratifiedKFold()
when an integer for cv is supplied... and maybe GridSearchCV
weights the scores by size of fold? I'm not sure on that, it would make sense though. Is my implementation flawed?
edit: random_state
added to SKFold
Upvotes: 3
Views: 2279
Reputation: 16966
If you set the set the random_state of the RandomForestClassifier
, the variation between different girdsearchCV
would be eliminated.
For simplification, I have set n_estimators =10 and got the following result
mean_train_score mean_test_score
param_clf__C param_clf__penalty param_ rfs_max_features mean std mean std
1.0 l2 5 0.766701 0.000000 0.580727 0.0 10 0.768849 0.000000 0.577737 0.0
Now, if see the performance on each split (by removing key_feats
filtering) of the best hyper parameters, using
all_results_df.sort_values(('mean_test_score'), ascending=False).head(1).T
we will get
16
mean_fit_time 0.228381
mean_score_time 0.113187
mean_test_score 0.580727
mean_train_score 0.766701
param_clf__C 1
param_clf__class_weight balanced
param_clf__penalty l2
param_rfs__max_features 5
params {'clf__class_weight': 'balanced', 'clf__penalt...
rank_test_score 1
split0_test_score 0.427273
split0_train_score 0.807051
split1_test_score 0.47
split1_train_score 0.791745
split2_test_score 0.54
split2_train_score 0.789243
split3_test_score 0.78
split3_train_score 0.769856
split4_test_score 0.7
split4_train_score 0.67561
std_fit_time 0.00586908
std_score_time 0.00152781
std_test_score 0.13555
std_train_score 0.0470554
let us reproduce this!
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
scores = []
weights = []
for train, test in skf.split(X, y):
small_pipe_w_params = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=10,
random_state=0),max_features=5)),
('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=1.0,random_state=0))
])
small_pipe_w_params.fit(X[train, :], y[train])
probas = small_pipe_w_params.predict_proba(X[test, :])
# cache probas here to build an Roc w/ conf interval later
scores.append(roc_auc_score(y[test], probas[:,1]))
weights.append(len(test))
print(scores)
print('mean: {:<1.6f}, std: {:<1.3f}'.format(np.average(scores, axis=0, weights=weights), np.std(scores)))
[0.42727272727272736, 0.47, 0.54, 0.78, 0.7]
mean: 0.580727, std: 0.135
Note: mean_test_score
is not just simple average, its a weighted mean.
Reason being iid
param
From Documentation:
iid : boolean, default=’warn’ If True, return the average score across folds, weighted by the number of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. If False, return the average score across folds. Default is True, but will change to False in version 0.21, to correspond to the standard definition of cross-validation.
Changed in version 0.20: Parameter iid will change from True to False by default in version 0.22, and will be removed in 0.24.
Upvotes: 2