Reputation: 177
I was performing cross-validation using xgboost.cv but then wanted to change to cross_val_score to use it with GridSearchCV. Before moving to hyperparameters tuning I checked if results from xgboost.cv and cross_val_score are similar and found out that there are huge differences.
I use xgboost.cv as:
params = {"objective":"binary:logistic",'colsample_bytree': 1,'learning_rate': 0.3, 'max_depth': 6, 'alpha': 0}
dmatrix = xgboost.DMatrix(table_X,table_y)
xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, nfold=5,
num_boost_round=100, early_stopping_rounds=10, metrics="aucpr", as_pandas=True)
and the last row for the xgb_cv is:
train-aucp-mean | train_acupr-std | test-aucp-mean | test_acupr-std |
---|---|---|---|
0.81 | 0.00 | 0.77 | 0.00 |
For cross_val score I use
xgb = xgboost.XGBClassifier(n_estimators=100, **params)
skf = StratifiedKFold(n_splits=5)
cross_val_scores = cross_val_score(xgb,table_X,table_y, scoring='average_precision', cv=skf)
And it ends up with a mean of 0,64. That is a worrisome difference. What am I doing wrong?
Secondly 0 standard deviation for results in xboost.cv looking quite strange.
Upvotes: 3
Views: 2353
Reputation: 46888
In the xgboost.cv function, "aucpr" is used, thanks to @BenReiniger for pointing this out, in the documentation this will be area under the precision recall curve using the linear trapezoidal method whereas average_precision
from sklearn uses another method.
So if we stick to the method used by sklearn (equivalent in xgboost is "map'), it will give a very similar score.
Example dataset:
from sklearn import datasets
import xgboost
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import mean_absolute_error
iris = datasets.load_iris()
X = iris.data
y = (iris.target == 1).astype(int)
dmatrix = xgboost.DMatrix(X,y)
Params and we can use the same kfold for both functions:
params = {"objective":"binary:logistic",'colsample_bytree': 1,'learning_rate': 0.3, 'max_depth': 6, 'alpha': 0}
skf = StratifiedKFold(n_splits=5)
You cannot set the early stopping round, because in the sklearn cross_val_score function, this is not possible, so we have to boost it the same number :
xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, folds = skf, metrics = "map", as_pandas=True, num_boost_round = 100)
xgb = xgboost.XGBClassifier(n_estimators=100, **params)
cross_val_scores = cross_val_score(xgb,X,y, scoring='average_precision',cv=skf)
print(cross_val_scores)
[1. 1. 0.8915404 0.91916667 1. ]
Above give a mean of 0.9621414141414141
And the cv score , with the test-map-mean column similar to the above :
train-map-mean train-map-std test-map-mean test-map-std
95 0.999878 0.000244 0.962562 0.046144
96 0.999878 0.000244 0.962562 0.046144
97 0.999878 0.000244 0.962562 0.046144
98 0.999878 0.000244 0.962562 0.046144
99 0.999878 0.000244 0.962562 0.046144
To use the trapezoidal method (i.e interpolation), the equivalent in sklearn and xgboost :
xgb_cv = xgboost.cv(dtrain=dmatrix, params=params, folds = skf, metrics = "aupr", as_pandas=True, num_boost_round = 100)
cross_val_scores = cross_val_score(xgb,X,y, scoring='roc_auc',cv=skf)
Upvotes: 3