Ignacio Valenzuela
Ignacio Valenzuela

Reputation: 137

Why am I getting very little variance in predict_proba values in XGBoost?

I'm having trouble understanding why all the values when calling the predict_proba function in the xgboost library in python are in a quite close range of values, even though the model AUC in the test set is good enough for the problem at hand (0.78).

As you can see, the variance is low and the results are quite near around the 50% mark.

The test size is approximately a 15% of the available data (5000 observations).

I'm using the following parameters:

{'colsample_bytree': 0.5, 'gamma': 2, 'learning_rate': 0.01, 'max_depth': 8, 'min_child_weight': 10,
                'n_estimators': 10, 'scale_pos_weight': 7}

Am I missing something here?

Upvotes: 1

Views: 946

Answers (1)

Mortz
Mortz

Reputation: 4879

Without access to the data you are working with, it is impossible to say why exactly you are seeing what you are seeing.

That said, however -

  • The simplest solution is to validate against an "Out-of-time" dataset
  • Check for the variance / cardinality of your input features. If, for example, you have 2 independent binary variables, then you only have 4 possible combinations for both of them. No matter how large your training dataset - your predict_proba will only give 4 values.

Upvotes: 1

Related Questions