Reputation: 145
I am using Sklearn GridSearchCv
to find the best parameters for a random forest when applied to remote sensing data with 4 classes (buildings, vegetation, water and roads), the problem is I have a lot more "vegetation" classes than the rest (by a lot I mean a difference from thousands to several millions). Should I balance my testing dataset to obtain the metrics?
I already balance the whole set before i split into training and testing, this means that both datasets have the same distribution of classes in a equal manner. I am afraid this does not represent the algorithm's performance on real data, but it gives me a insight of the performance per class. If i use unbalanced data, the "vegetation" class might end up messing with the other averages.
Here's the example of the balance i do, as you can see I do it on the X and y directly. Which are the full data and labels.
if balance:
smt = RandomUnderSampler(sampling_strategy='auto')
X, y = smt.fit_sample(X, y)
print("Features array shape after balance: " + str(X.shape))
I want to have the best understanding of the model's performance on the real data, but I have not found conclusive answers for this!
Upvotes: 9
Views: 12833
Reputation: 486
The thumb rule of dealing with imbalenced data is "Never ever balance the test data". the pipeline of dealing with imbalance data:
So that you will get the actual performance.
The question arises here is why not to balance data before train test split?
You can't expect the real world data to be balanced when you are deploying in the real world right...
A better way is to use K-fold at step 2 and do the 3,4,5 steps for each fold
Refer to this article for more info.
Upvotes: 20