WatsOne
WatsOne

Reputation: 25

Why decision tree works perfect on imbalanced data?

I experimented with fraud sampling from kaggle.

The sample consists of 284807 transactions of which 497 is one class, the rest is another, the ratio is 0.172%. There is unbalanced problem and I wanted to test how simple random undersampling works. I split the sample into 20 parts, and checked through the area under the precision-recall curve.

I took linear regression and decision tree. Linear regression works as expected: enter image description here

But it seems that decision tree works perfect: enter image description here

We have very high precision and recall, and undersampling make them worse. Why there is so big difference between two models?

Upvotes: 1

Views: 6739

Answers (1)

AndyShan
AndyShan

Reputation: 181

First, generally speaking, a simple decision tree model can not solve the unbalanced problem very well.The performance of the model is strongly related to the actual distribution of the data set.

There are several situations that can lead to decision tree models to solve unbalanced problems, you can check to see if the situation you mentioned in the question is consistent with the following:

  1. Minority data are all in one area of the feature space.The training process of the decision tree is a recursive process, the algorithm will continue to choose the optimal partitioning properties, generation of branches and nodes, until meet: 1) the current node contains the samples all belong to the same category, do not need to divide 2) the attribute set is empty, or all samples in all attribute values are the same, unable to divide 3) the current node contains the sample set is empty, can not be divided.So if minority data are all in one area of the feature space, then all samples will be partitioned into a node, and in the prediction, if the test set is also such a feature distribution, then a good classifier will be obtained.
  2. You are using the decision tree that uses the cost-sensitive Learning.If your decision is cost-sensitive, the misclassifications of minority class samples will have a higher cost than misclassifications of majority class samples.

If you use ensemble learning, the model will perform well, but that's not the decision tree, it's RF or GBDT

For simple classifiers using linear regression, such as logistic regression, the performance is almost certainly bad when faced with an unbalanced problem.This is because, in training, the model is looking for a hyperplane that makes misclassification least.As a result, the model sorts all samples into most label.

Upvotes: 3

Related Questions