Reputation: 69
Say I have 6 different categories I'm trying to classify my data into using a NN. How important is it to training that I have an equal number of instances for each class? Presently I have like 50k for one class, 6k for another, 300 for another.. you get the picture. How big of a problem is this? I'm thinking I might nix some of the classes with low representation, but I'm not sure what a good cutoff would be, or if it would really be important.
Upvotes: 1
Views: 49
Reputation: 1163
Imbalanced data is generally a problem for machine learning. Particularly when the classes are severely imbalanced (such as in your case). In a nutshell, the algorithm wont be able to learn the right associations between the features and the categories for all classes. It will most likely miss the rules and or rely too much on the majority class(es). Have a look at the imblearn
package. General solutions for imbalanced data are to either :
Other considerations include changing your performance metric to include precision/recall rather than accuracy (for example).
This link should provide some further examples that might be helpful
Upvotes: 1