N. Kiefer
N. Kiefer

Reputation: 337

Low probabilities when using xgboost on multiclass problem

I am using a xgbclassifier to do text classification with more than two classes. Reaching about ~65% accuracy I looked into the probabilities the model is outputting. For no test example I am showing to the model it is outputting more than 0.3 for any given class. Even when the model is correct it is therefore choosing a class on the difference of about 20%.

Is that something I should be worried about? I would expect the model to be sure (therefore outputting around 90%) at least in some cases. Is there even such an easy interpretation of the output probabilities? Or should I not be worried about the output probabilities as long as the class is correct?

Edit: I have around a 100 classes, which are also imbalanced, roughly 3 categories take up 70% of the whole data. The sizes more or less decrease linearly.

The data itself are german texts, if anybody is interested.

Upvotes: 1

Views: 349

Answers (1)

cousin_pete
cousin_pete

Reputation: 578

Welcome to SO! In the absence of any data sample or code it is hard to comment on what the issues are.

What are the class distributions in your data? Say for example you had five classes equally distributed ie about 20% each. Then getting an output of 0.20 for some observations in a particular class could well be highly significant ie the model is pretty sure about this allocation.

Is it possible to post some data and code, if the data is sensitive then anonymize it.

Upvotes: 1

Related Questions