John Doe
John Doe

Reputation: 677

Steps for a highly imbalanced classification steps. Should I up-sample & under-sample data or just up-sample the imbalanced class

I have a highly imbalanced binary (yes/no) classification dataset. The dataset currently has appx 0.008% 'yes'.

I need to balance the dataset using SMOTE.

I came across 2 method to deal with the imbalance. The following steps after I have run MinMaxScaler on the variables

from imblearn.pipeline import Pipeline
oversample = SMOTE(sampling_strategy = 0.1, random_state=42)
undersample = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
steps = [('o', oversample), ('u', undersample)]
pipeline = Pipeline(steps=steps)
x_scaled_s, y_s = pipeline.fit_resample(X_scaled, y)

This results in a reduction in the size of the dataset from 2.4million rows to 732000 rows And the imbalance improves from 0.008% to 33.33%

While this approach

sm = SMOTE(random_state=42)
X_sm , y_sm = sm.fit_sample(X_scaled, y)

This increases the number of rows from 2.4million rows to 4.8 million rows and the imbalance is now 50%.

After these steps I need to split data into Train Test datasets....

What is the right approach here?

What factors do I need to consider before I choose any of these methods?

Should I run the X_test, y_test on unsampled data. This would mean, I split the data and do upsampling/undersampling only on the train data.

Thank you.

JD

Upvotes: 0

Views: 137

Answers (1)

maya-ami
maya-ami

Reputation: 410

After these steps I need to split data into Train Test datasets....

No! Any resampling techniques should be applied only on the train set. This will ensure that the test set reflects the reality. The model performance obtained on such a test set will be a good estimate of your model's generalization ability. If the resampling is performed on the whole dataset, your model's performance is going to be overly optimistic.

Steps:

  1. Split your dataset into train and test sets.
  2. Upsample/undersample only the train set.
  3. Train your model on the resampled train set.
  4. Estimate the performance on your untouched test set.

Upvotes: 1

Related Questions