gabboshow
gabboshow

Reputation: 5559

GridSearchCV does not give the same results as expected when compared to xgboost.cv

when comparing sklearn.GridSearchCV with xgboost.cv I get different results... below I explain what I would like to do:

  1. import libraries
    import numpy as np
    from sklearn import datasets
    import xgboost as xgb
    from sklearn.model_selection import GridSearchCV
    from xgboost.sklearn import XGBClassifier
    from sklearn.model_selection import StratifiedKFold
  1. set seed and folds
    seed = 5
    n_fold_inner = 5
    skf_inner               = StratifiedKFold(n_splits=n_fold_inner,random_state=seed, shuffle=True)
  1. load dataset
    X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
    X = X.astype(np.float32)
    
    # map labels from {-1, 1} to {0, 1}
    labels, y = np.unique(y, return_inverse=True)
    
    X_train, X_test = X[:2000], X[2000:]
    y_train, y_test = y[:2000], y[2000:]
    dtrain  = xgb.DMatrix(X_train,  label=y_train, missing = np.nan)
  1. define parameters xgboost
    fixed_parameters = {
                   'max_depth':3,
                   'min_child_weight':3,
                   'learning_rate':0.3,
                   'colsample_bytree':0.8,
                   'subsample':0.8,
                   'gamma':0,
                   'max_delta_step':0,
                   'colsample_bylevel':1,
                   'reg_alpha':0,
                   'reg_lambda':1,
                   'scale_pos_weight':1,
                   'base_score':0.5,
                   'seed':5,
                   'objective':'binary:logistic',
                   'silent': 1}
  1. parameters for which I do the grid search (only one, i.e. the number of estimators)
    params_grid = {
                   'n_estimators':np.linspace(1, 20, 20).astype('int')
  1. perform grid search
    bst_grid = GridSearchCV(
                estimator=XGBClassifier(**fixed_parameters),param_grid=params_grid,n_jobs=4,
                cv=skf_inner,scoring='roc_auc',iid=False,refit=False,verbose=1)
    
    bst_grid.fit(X_train,y_train)
    
    best_params_grid_search = bst_grid.best_params_
    best_score_grid_search = bst_grid.best_score_
    
       
    means_train = bst_grid.cv_results_['mean_train_score']
    stds_train = bst_grid.cv_results_['std_train_score']
    means_test = bst_grid.cv_results_['mean_test_score']
    stds_test = bst_grid.cv_results_['std_test_score']
  1. print results
    print('\ntest-auc-mean  test-auc-std  train-auc-mean  train-auc-std')
    for idx in range(0, len(means_test)):
        print means_test[idx], stds_test[idx], means_train[idx], stds_train[idx]
  1. now I run xgb.cv with the same parameters of before for 20 rounds (the n_estimators that I was giving as input to the gridsearch before. The problem is that I get different results...
    num_rounds = 20
    best_params_grid_search['objective']= 'binary:logistic'
    best_params_grid_search['silent']= 1
    cv_xgb = xgb.cv(best_params_grid_search,dtrain,num_boost_round =num_rounds,folds=skf_inner,metrics={'auc'},seed=seed,maximize=True)
    print(cv_xgb)

RESULT GRIDSEARCH (each row is using n estimators (1,2,3,...,20)

test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
0.610051313783 0.0161039540435 0.644057288587 0.0113345992869
0.69201880047 0.0162563563448 0.736006666658 0.00692672815659
0.745466211655 0.0171675737271 0.796345885396 0.00696679302744
0.783959748994 0.00705320521545 0.841463145757 0.00948465661336
0.814666429161 0.0205663250121 0.876016226998 0.00594191823748
0.834757856446 0.0380407635359 0.89839145346 0.0119466187041
0.846589877247 0.0250769570711 0.918506450202 0.00400934458132
0.856519550489 0.02076405634 0.929968936282 0.00287173282935
0.874262106553 0.0270140215944 0.940190511945 0.00335749381638
0.884796282407 0.0242102758081 0.947369708661 0.00274634034559
0.890833683342 0.0240690598159 0.953708404754 0.00332080069217
0.898287157179 0.0212975975614 0.958794323829 0.00463360376002
0.905931348284 0.0240526927266 0.963055575138 0.00385161158711
0.911782932073 0.0169788764956 0.966542306102 0.00274612227499
0.912551138778 0.0175200936415 0.969060984867 0.00135518880398
0.915046588665 0.0169918459539 0.971904231381 0.00177694652262
0.917921423036 0.0131486037603 0.975162276052 0.0025983006922
0.921909172729 0.0113192686772 0.976056924526 0.0022670828819
0.928131653291 0.0117709832599 0.978585868159 0.00211167800105
0.931493562339 0.0119475329984 0.98098486872 0.00186032225868

RESULT XGB.CV

    test-auc-mean  test-auc-std  train-auc-mean  train-auc-std
0        0.669881      0.013938        0.772116       0.011315
1        0.759682      0.019225        0.883394       0.004381
2        0.798337      0.016992        0.939274       0.005196
3        0.827751      0.007224        0.962461       0.007382
4        0.850340      0.011451        0.978809       0.001102
5        0.864438      0.020012        0.986584       0.000858
6        0.879706      0.014168        0.991765       0.001926
7        0.889308      0.013851        0.994663       0.000970
8        0.897973      0.011383        0.996704       0.000481
9        0.903878      0.012139        0.997494       0.000432
10       0.909599      0.010234        0.998301       0.000602
11       0.912682      0.014475        0.998972       0.000306
12       0.914289      0.014122        0.999392       0.000207
13       0.916273      0.011744        0.999568       0.000185
14       0.918050      0.011219        0.999718       0.000140
15       0.922161      0.011968        0.999788       0.000146
16       0.922990      0.010124        0.999863       0.000085
17       0.924221      0.009026        0.999893       0.000082
18       0.925718      0.008859        0.999929       0.000060
19       0.926104      0.007586        0.999959       0.000030

Upvotes: 6

Views: 1686

Answers (1)

Amine Benatmane
Amine Benatmane

Reputation: 1261

num_boost_round is the number of boosting iterations (i.e. n_estimators). XGBoost.cv will ignore n_estimators from params and overwrit it by num_boost_round.

Try this:

cv_xgb = xgb.cv(best_params_grid_search,dtrain,num_boost_round =best_params_grid_search['n_estimators'],folds=skf_inner,metrics={'auc'},seed=seed,maximize=True)

Upvotes: 1

Related Questions