Alex Ivanov
Alex Ivanov

Reputation: 737

How to evaluate the xgboost classification model stability

I have:

  1. Python xgboost classification model
  2. Weekly datasets (basis of classification) since the begining of 2018. Each of dataset has about 100 thousand rows and 70 columns (features).
  3. weekly prediction results on datasets via xgboost model (using logistic regression) in the format:
- date of modelling
- items
- test_auc_mean for each item (in percentage).

In total there are about 100 datasets and 100 prediction_results since January 2018.

To assess the model I use such metrics as:

-auc

-confusion matrix

-accuracy

param = {
    'num_parallel_tree':num_parallel_tree,
    'subsample':subsample,
    'colsample_bytree':colsample_bytree,
    'objective':objective, 
    'learning_rate':learning_rate, 
    'eval_metric':eval_metric, 
    'max_depth':max_depth,
    'scale_pos_weight':scale_pos_weight,
    'min_child_weight':min_child_weight,
    'nthread':nthread,
    'seed':seed
}

bst_cv = xgb.cv(
    param, 
    dtrain,  
    num_boost_round=n_estimators, 
    nfold = nfold,
    early_stopping_rounds=early_stopping_rounds,
    verbose_eval=verbose,
    stratified = stratified
)

test_auc_mean = bst_cv['test-auc-mean']
best_iteration = test_auc_mean[test_auc_mean == max(test_auc_mean)].index[0]

bst = xgb.train(param, 
                dtrain, 
                num_boost_round = best_iteration)

best_train_auc_mean = bst_cv['train-auc-mean'][best_iteration]
best_train_auc_mean_std = bst_cv['train-auc-std'][best_iteration]

best_test_auc_mean = bst_cv['test-auc-mean'][best_iteration]
best_test_auc_mean_std = bst_cv['test-auc-std'][best_iteration]

print('''XGB CV model report
Best train-auc-mean {}% (std: {}%) 
Best test-auc-mean {}% (std: {}%)'''.format(round(best_train_auc_mean * 100, 2), 
                                          round(best_train_auc_mean_std * 100, 2), 
                                          round(best_test_auc_mean * 100, 2), 
                                          round(best_test_auc_mean_std * 100, 2)))

y_pred = bst.predict(dtest)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred>0.9).ravel()


print('''
     | neg | pos |
__________________
true_| {}  | {}  |
false| {}  | {}  |
__________________

'''.format(tn, tp, fn, fp))

predict_accuracy_on_test_set = (tn + tp)/(tn + fp + fn + tp)
print('Test Accuracy: {}%'.format(round(predict_accuracy_on_test_set * 100, 2)))

The model gives me general picture (as usually, auc is between .94 and .96) The problem is that the variability of predicting of some specific items is very high (today an item is positive, tomorrow an item is negative, the day after tomorrow - positive again)

I wanna evaluate the model' stability. In other words, I wanna know, how many items with variable results does it generate. In the end, I wanna be ensured, that the model will generate stable results with minimal fluctuation. Do you have some thoughts how to do this?

Upvotes: 2

Views: 3754

Answers (2)

Gwendal Yviquel
Gwendal Yviquel

Reputation: 392

That's precisely the goal of cross validation. Since you already did it, you can only evaluate standard deviation of your evaluation metrics, you already did it aswell...

  1. You can try some new metrics, like precision,recall,f1 score or fn score to weight success and failure differently but it looks like your almost out of solutions. You're dependant to your data input here :s

  2. You could spend some time on training population distribution, and try to identify which part of the population fluctuate over time.

  3. You could also try to predict proba and not classification to evaluate if the model is far above its threshold or not.

The last two solution are more like side solutions. :(

Upvotes: 4

Alex Ivanov
Alex Ivanov

Reputation: 737

Predict-proba of 1 item (mean-auc Gwendal, thank you. Would you specify 2 approaches you mentioned. 1) how can I train population distribution? via K-Clustering or other methods of unsupervised learning? 2) E.g. I predicted_proba (the diagram of 1 specific item - is in the attachement). How can I evaluate if the model is far above its threshold? Via comparison predicted_proba of each item with it's true label (e.g. predict_proba = 0.5 and label = 1)?

Upvotes: 1

Related Questions