Should I balance the test set when i have highly unbalanced data?

Question

I am using Sklearn GridSearchCv to find the best parameters for a random forest when applied to remote sensing data with 4 classes (buildings, vegetation, water and roads), the problem is I have a lot more "vegetation" classes than the rest (by a lot I mean a difference from thousands to several millions). Should I balance my testing dataset to obtain the metrics?

I already balance the whole set before i split into training and testing, this means that both datasets have the same distribution of classes in a equal manner. I am afraid this does not represent the algorithm's performance on real data, but it gives me a insight of the performance per class. If i use unbalanced data, the "vegetation" class might end up messing with the other averages.

Here's the example of the balance i do, as you can see I do it on the X and y directly. Which are the full data and labels.

if balance:
    smt = RandomUnderSampler(sampling_strategy='auto')
    X, y = smt.fit_sample(X, y)
    print("Features array shape after balance: " + str(X.shape))

I want to have the best understanding of the model's performance on the real data, but I have not found conclusive answers for this!

Veera Srikanth · Accepted Answer

The thumb rule of dealing with imbalenced data is "Never ever balance the test data". the pipeline of dealing with imbalance data:

Do preprocess
Apply train test split(Stratified).
Balance the training data (Generally SMOTE works better)
Train model/models
Test on imbalance test data(Obviously use metrics like f-score, Precision, Recall)

So that you will get the actual performance.

The question arises here is why not to balance data before train test split?

You can't expect the real world data to be balanced when you are deploying in the real world right...

A better way is to use K-fold at step 2 and do the 3,4,5 steps for each fold

Refer to this article for more info.

Should I balance the test set when i have highly unbalanced data?

Answers (1)

Related Questions