Dail
Dail

Reputation: 4606

Up Sampling imbalanced dataset's minor classes

i am using scikit-learn to classify my data, at the moment i am running a simple DecisionTree classifier. I have three classes with a big imbalanced problem. The classes are 0,1 and 2. The minor classes are 1 and 2.

To give you an idea about the number of samples of the classes:

0 = 25.000 samples
1 = 15/20 less or more
2 = 15/20 less or more

so minor classes are about 0.06% of the dataset. The approach that i am following to solve the imbalance problem is the UPSAMPLING of the minor classes. Code:

from sklearn.utils import resample,
resample(data, replace=True, n_samples=len_major_class, random_state=1234)

Now comes the problem. I did two tests:

  1. If I upsample the minor classes and then divide my dataset in two groups one for training and one for testing... the accuracy is:
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     20570
          1       1.00      1.00      1.00     20533
          2       1.00      1.00      1.00     20439

avg / total       1.00      1.00      1.00     61542

very good result.

  1. If I ONLY upsample the training data and leave the original data for testing, the result is:
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     20570
          1       0.00      0.00      0.00        15
          2       0.00      0.00      0.00        16

avg / total       1.00      1.00      1.00     20601

as you can see the global accuracy is high, but the accuracy of the class 1 and 2 is zero.

I am creating the classifier in this way:

DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234, criterion='entropy')

I also have tried adding the class_weight with balanced value, but it makes no difference.

I should only upsample the training data, why am i getting this strange problem?

Upvotes: 2

Views: 2191

Answers (3)

E.G. Cortes
E.G. Cortes

Reputation: 75

I have a function that resamples the dataset for each class to have the same amount of instance.

from sklearn.utils import resample
import pandas as pd

def make_resample(_df, column):

  dfs_r = {}
  dfs_c = {}
  bigger = 0
  ignore = ""
  for c in _df[column].unique():
    dfs_c[c] = _df[df[column] == c]
    if dfs_c[c].shape[0] > bigger:
      bigger = dfs_c[c].shape[0]
      ignore = c

  for c in dfs_c:
    if c == ignore:
      continue
    dfs_r[c] = resample(dfs_c[c], 
                        replace=True,
                        n_samples=bigger - dfs_c[c].shape[0],
                        random_state=0)
  return pd.concat([dfs_r[c] for c in dfs_r] + [_df])

Upvotes: 0

Roberto
Roberto

Reputation: 755

The fact that you obtain that behavior is quite normal when you do the re-sampling before the splitting; you are inducing a bias in your data.

If you oversample the data and then split, the minority samples in the test won't be anymore independent from the samples in the training set because they are generated together. In your case they are exact copies of the samples in the training set. Your accuracy is 100% because the classifier is classifying samples that have already been seen in the training.

Since your problem is strongly umbalanced I would suggest to use an ensemble of classifiers to handle it. 1) Split your dataset in training set and test set. Given the size of the dataset you can sample 1-2 samples from the minority class for test and leave the other for training. 2) From the training you generate N datasets containing all the remaining samples of the minority class and under-samples from the majority class (i would say 2*number of samples in the minority class). 3) For each one of the dataset obtained you train a model. 4) Use the test set to obtain the prediction; the final prediction will be the results of a majority vote of all the predictions of the classifiers.

To have robust metrics perform different iterations with different initial splitting test/training.

Upvotes: 7

Venkatachalam
Venkatachalam

Reputation: 16966

You should not split the dataset after upsampling. You can do the upsampling within the training data.

Basically, you are leaking the test data into the training data.

Upvotes: 2

Related Questions