Reputation: 465
I am training a machine learning model on a classification problem. My dataset is 10000 observations with 37 categorical class. But the data is imbalanced, I have some classes with 100 observations and some other classes with 3000 and 4000 observations.
After searching on how to do some feature engineering on this type of data to improve the performance of the algorithm. I found 2 solutions:
According to the first solution:
I have many classes with a few observations so it will require much more data and long time. so it will be a hard for me!
And by applying the second one:
I think all classes will have a few observations and the data will be very small so that it will be hard for the algorithm to generalize.
So Is there another solution I can try for this problem?
Upvotes: 0
Views: 965
Reputation: 31
You could use a combination of both.
It sounds like you are worried about getting a dataset that is too large if you upsample all minority classes to match the majority classes. If this is the case, you can downsample the majority classes to something like 25% or 50%, and at the same time upsample the minority classes. An alternative to upsampling is synthesising samples for the minority classes using an algorithm like SMOTE.
If you are training a neural network in batch, it is good to make sure that the training set is properly shuffled and that you have an even-is distribution of minority/majority samples across the mini batches.
Upvotes: 0
Reputation: 2129
You can change the weights in your loss function so that the smaller classes have larger importance when optimizing. In keras you can use weighted_cross_entropy_with_logits, for example.
Upvotes: 1