overb
overb

Reputation: 177

XGBoost - huge difference between xgb.cv and cross_val_score

I was performing cross-validation using xgboost.cv but then wanted to change to cross_val_score to use it with GridSearchCV. Before moving to hyperparameters tuning I checked if results from xgboost.cv and cross_val_score are similar and found out that there are huge differences.

I use xgboost.cv as:

params = {"objective":"binary:logistic",'colsample_bytree': 1,'learning_rate': 0.3, 'max_depth': 6, 'alpha': 0}

dmatrix = xgboost.DMatrix(table_X,table_y)

xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, nfold=5,
                    num_boost_round=100, early_stopping_rounds=10, metrics="aucpr", as_pandas=True)

and the last row for the xgb_cv is:

train-aucp-mean train_acupr-std test-aucp-mean test_acupr-std
0.81 0.00 0.77 0.00

For cross_val score I use

xgb = xgboost.XGBClassifier(n_estimators=100, **params)

skf = StratifiedKFold(n_splits=5)
cross_val_scores = cross_val_score(xgb,table_X,table_y, scoring='average_precision', cv=skf)

And it ends up with a mean of 0,64. That is a worrisome difference. What am I doing wrong?

Secondly 0 standard deviation for results in xboost.cv looking quite strange.

Upvotes: 3

Views: 2353

Answers (1)

StupidWolf
StupidWolf

Reputation: 46888

In the xgboost.cv function, "aucpr" is used, thanks to @BenReiniger for pointing this out, in the documentation this will be area under the precision recall curve using the linear trapezoidal method whereas average_precision from sklearn uses another method.

So if we stick to the method used by sklearn (equivalent in xgboost is "map'), it will give a very similar score.

Example dataset:

from sklearn import datasets
import xgboost
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import mean_absolute_error

iris = datasets.load_iris()
X = iris.data
y = (iris.target == 1).astype(int)
dmatrix = xgboost.DMatrix(X,y)

Params and we can use the same kfold for both functions:

params = {"objective":"binary:logistic",'colsample_bytree': 1,'learning_rate': 0.3, 'max_depth': 6, 'alpha': 0}

skf = StratifiedKFold(n_splits=5)

You cannot set the early stopping round, because in the sklearn cross_val_score function, this is not possible, so we have to boost it the same number :

xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, folds = skf, metrics = "map", as_pandas=True, num_boost_round = 100)

xgb = xgboost.XGBClassifier(n_estimators=100, **params)

cross_val_scores = cross_val_score(xgb,X,y, scoring='average_precision',cv=skf)

print(cross_val_scores)
[1.         1.         0.8915404  0.91916667 1.        ]

Above give a mean of 0.9621414141414141

And the cv score , with the test-map-mean column similar to the above :

    train-map-mean  train-map-std  test-map-mean  test-map-std
95        0.999878       0.000244       0.962562      0.046144
96        0.999878       0.000244       0.962562      0.046144
97        0.999878       0.000244       0.962562      0.046144
98        0.999878       0.000244       0.962562      0.046144
99        0.999878       0.000244       0.962562      0.046144

To use the trapezoidal method (i.e interpolation), the equivalent in sklearn and xgboost :

xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, folds = skf, metrics = "aupr", as_pandas=True, num_boost_round = 100)

cross_val_scores = cross_val_score(xgb,X,y, scoring='roc_auc',cv=skf)

Upvotes: 3

Related Questions