Reputation: 6615
How do I run a grid search with sklearn xgboost and get back various metrics, ideally at the F1 threshold value?
See my code below...can't find what I'm doing wrong/don't understand errors..
######################### just making up a dataset here##############
from sklearn import datasets
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score, make_scorer
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.grid_search import RandomizedSearchCV
import xgboost as xgb
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=10,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99,
random_state=42)
The rest is bunch of parameters and then a random grid search.... if I change 'SCORING_EVALS' to 'roc_auc' then it works...If I try and do what seems to be the documented approach for this I get an error? Where I am going wrong?
Additionally, how do I ensure that these metrics are reported at the F1 Threshold!?
params = {
'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
'gamma': [0, 0.25, 0.5, 1.0],
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
"max_depth": [2,4,6,10],
"learning_rate": [0.05,0.1, 0.2, 0.3,0.4],
"colsample_bytree":[1, .8, .5],
"subsample": [0.8],
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
'n_estimators': [50]
}
folds = 5
max_models = 5
scoring_evals = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': make_scorer(precision_score),'Recall': make_scorer(recall_score)}
xgb_algo = xgb.XGBClassifier()
random_search = RandomizedSearchCV(xgb_algo,
param_distributions=params, n_iter=max_models,
scoring= scoring_evals, n_jobs=4, cv=5, verbose=False, random_state=2018 )
random_search.fit(X_train, y_train)
My errors are:
ValueError: scoring value should either be a callable, string or None. {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': make_scorer(precision_score), 'Recall': make_scorer(recall_score)} was passed
Upvotes: 1
Views: 3704
Reputation: 36599
First check the version of scikit-learn you are using. If its v0.19
, then you are using the deprecated module.
You are doing this:
from sklearn.grid_search import RandomizedSearchCV
And you must have gotten a warning like:
DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. ... ... ...
The classes in the grid_search
module are old and deprecated and dont contain the multi-metric functionality you are using.
Pay attention to that warning and do this:
from sklearn.model_selection import RandomizedSearchCV
...
...
...
random_search = RandomizedSearchCV(xgb_algo,
param_distributions=params,
n_iter=max_models,
scoring= scoring_evals, n_jobs=4, cv=5,
verbose=False, random_state=2018, refit=False )
Now look closely at the refit
param. In the multi-metric setting, you need to set this so that the final model can be fitted to that, because the best hyper-parameters for the model will be decided based on a single metric only.
You can either set it to False
if you dont want the final model and only want the performance of the model on the data and different params or set that to any of the key
you have in your scoring dict.
Upvotes: 2
Reputation: 3213
As the error suggests, and as the documentation of v0.18.2 states:
scoring : string, callable or None, default=None
one can not provided multiple metrics into scoring
argument (in this scikit-learn version).
P.S. All functions that you tried to wrap into make_scorer
are already predefined as standard scorers, so you can use their string names: see docs
EDITED: removed comment on usage of multiple metrics following criticism of Vivek
Upvotes: -1