Reputation: 25
I experimented with fraud sampling from kaggle.
The sample consists of 284807 transactions of which 497 is one class, the rest is another, the ratio is 0.172%. There is unbalanced problem and I wanted to test how simple random undersampling works. I split the sample into 20 parts, and checked through the area under the precision-recall curve.
I took linear regression and decision tree. Linear regression works as expected: enter image description here
But it seems that decision tree works perfect: enter image description here
We have very high precision and recall, and undersampling make them worse. Why there is so big difference between two models?
Upvotes: 1
Views: 6739
Reputation: 181
First, generally speaking, a simple decision tree model can not solve the unbalanced problem very well.The performance of the model is strongly related to the actual distribution of the data set.
There are several situations that can lead to decision tree models to solve unbalanced problems, you can check to see if the situation you mentioned in the question is consistent with the following:
If you use ensemble learning, the model will perform well, but that's not the decision tree, it's RF or GBDT
For simple classifiers using linear regression, such as logistic regression, the performance is almost certainly bad when faced with an unbalanced problem.This is because, in training, the model is looking for a hyperplane that makes misclassification least.As a result, the model sorts all samples into most label.
Upvotes: 3