insomnia
insomnia

Reputation: 301

About how to balance imbalanced data

When I read Decision Tree in Scikit learn, I find:

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.

In the link: http://scikit-learn.org/stable/modules/tree.html

I am confused.

(1)

Class balancing can be done by sampling an equal number of samples from each class

If I do like this, should I use add a proper sample weight for each samples in each class( or add class sample...).

For example, if I have two classes: A and B with number of samples

A:100 B:10000

Can I input 10000 samples for each and set weight:

input samples of A:10000, input samples of B:10000

weight of A:0.01 , weight of B: 1.0

(2)

But it still said:

preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value

I totally confused by it. Does it means I should input 100 samples of A and 10000 samples of B then set weight:

input samples of A:100, input samples of B:10000

weight of A:1.0 , weight of B: 1.0

But it seems I did nothing to balance the imbalanced data.

Which way is better and what's the meaning of second way in Scikit learn? Can anyone help me clarify it?

Upvotes: 1

Views: 1813

Answers (1)

abhiieor
abhiieor

Reputation: 3554

There are many ways to balance the dataset:

  1. oversampling (draw more sample without substitution) from underrepresented class
  2. Undersampling (draw less sample with/without substitution) from overrepresented class
  3. Neighborhood based fabricated data for underrepresented class (search SMOTE)
  4. Weight based method: Though you need to tune this but roughly you can start with choosing weights which make weight*number of observations equal for both under represented and over represented groups.

Upvotes: 1

Related Questions