About how to balance imbalanced data

Question

When I read Decision Tree in Scikit learn, I find:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.

In the link: http://scikit-learn.org/stable/modules/tree.html

I am confused.

(1)

Class balancing can be done by sampling an equal number of samples from each class

If I do like this, should I use add a proper sample weight for each samples in each class( or add class sample...).

For example, if I have two classes: A and B with number of samples

A:100 B:10000

Can I input 10000 samples for each and set weight:

input samples of A:10000, input samples of B:10000

weight of A:0.01 , weight of B: 1.0

(2)

But it still said:

preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value

I totally confused by it. Does it means I should input 100 samples of A and 10000 samples of B then set weight:

input samples of A:100, input samples of B:10000

weight of A:1.0 , weight of B: 1.0

But it seems I did nothing to balance the imbalanced data.

Which way is better and what's the meaning of second way in Scikit learn? Can anyone help me clarify it?

About how to balance imbalanced data

Answers (1)

Related Questions