Reputation: 301
When I read Decision Tree in Scikit learn, I find:
Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
In the link: http://scikit-learn.org/stable/modules/tree.html
I am confused.
(1)
Class balancing can be done by sampling an equal number of samples from each class
If I do like this, should I use add a proper sample weight for each samples in each class( or add class sample...).
For example, if I have two classes: A and B with number of samples
A:100 B:10000
Can I input 10000 samples for each and set weight:
input samples of A:10000, input samples of B:10000
weight of A:0.01 , weight of B: 1.0
(2)
But it still said:
preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value
I totally confused by it. Does it means I should input 100 samples of A and 10000 samples of B then set weight:
input samples of A:100, input samples of B:10000
weight of A:1.0 , weight of B: 1.0
But it seems I did nothing to balance the imbalanced data.
Which way is better and what's the meaning of second way in Scikit learn? Can anyone help me clarify it?
Upvotes: 1
Views: 1813
Reputation: 3554
There are many ways to balance the dataset:
weight*number of observations
equal for both under represented and over represented groups.Upvotes: 1