Reputation: 677
I have a highly imbalanced binary (yes/no) classification dataset. The dataset currently has appx 0.008% 'yes'.
I need to balance the dataset using SMOTE.
I came across 2 method to deal with the imbalance. The following steps after I have run MinMaxScaler on the variables
from imblearn.pipeline import Pipeline
oversample = SMOTE(sampling_strategy = 0.1, random_state=42)
undersample = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
steps = [('o', oversample), ('u', undersample)]
pipeline = Pipeline(steps=steps)
x_scaled_s, y_s = pipeline.fit_resample(X_scaled, y)
This results in a reduction in the size of the dataset from 2.4million rows to 732000 rows And the imbalance improves from 0.008% to 33.33%
While this approach
sm = SMOTE(random_state=42)
X_sm , y_sm = sm.fit_sample(X_scaled, y)
This increases the number of rows from 2.4million rows to 4.8 million rows and the imbalance is now 50%.
After these steps I need to split data into Train Test datasets....
What is the right approach here?
What factors do I need to consider before I choose any of these methods?
Should I run the X_test, y_test on unsampled data. This would mean, I split the data and do upsampling/undersampling only on the train data.
Thank you.
JD
Upvotes: 0
Views: 137
Reputation: 410
After these steps I need to split data into Train Test datasets....
No! Any resampling techniques should be applied only on the train set. This will ensure that the test set reflects the reality. The model performance obtained on such a test set will be a good estimate of your model's generalization ability. If the resampling is performed on the whole dataset, your model's performance is going to be overly optimistic.
Steps:
Upvotes: 1