sarika
sarika

Reputation: 49

Undersampling for imbalance data after train test split

I am working on a project with imbalanced data. I want to balance the data using random undersampling. I am confused if i should do the undersampling after test train split or should i do undersampling 1st and then do train test split?

My approach :

  1. I used train test split to get : X_train, y_train for training and X_test and y_test for testing.
  2. I combined X_train and y_train into one data set and did the undersampling.
  3. After undersampling, i performed Cross validation and model selection based on F1 score and using X_test.,Y_test for prediction.

Is my approach correct? Please correct me if i am wrong.

Upvotes: 4

Views: 4132

Answers (1)

maya-ami
maya-ami

Reputation: 410

Let's go through your approach:

I used train test split to get : X_train, y_train for training and X_test and y_test for testing. I combined X_train and y_train into one data set and did the undersampling.

That's right. Any resampling techniques should be applied only on the train set. This will ensure that the test set reflects the reality. The model performance obtained on such a test set will be a good estimate of your model's generalization ability. If the resampling is performed on the whole dataset, your model's performance is going to be overly optimistic.

After undersampling, i performed Cross validation and model selection based on F1

It's difficult to understand what exactly has been done without the code, but it seems you've done the cross validation on already resampled train data. That's wrong, and the undersampling should have been done on each test fold during cross validation. Let's consider 3-fold CV the way it should be done:

  1. Train set is divided in 3 folds. 2 folds are used for training, 1 - for testing.
  2. You apply resampling on these 2 folds, train your model and then estimate the performance on the untouched 1 fold.
  3. Repeat steps 1-2 on until each fold is used as a test set.

Thus, what you should do is: 1. Split the data on train and test. 2. Perform CV on your trains set. Apply undersampling only on a test fold. 3. After the model has been chosen with the help of CV, undersample your train set and train the classifier. 4. Estimate the performance on the untouched test set.

Upvotes: 6

Related Questions