madan ram
madan ram

Reputation: 1270

Does the number of datasets for the classification of different classes matter

I have sample training data set , I wanted to know weather the number of date for different class matter.Should I balance the dataset between class.

Upvotes: 0

Views: 72

Answers (1)

Pedrom
Pedrom

Reputation: 3833

The asymmetry of the representation of classes in the training data is usually called Skewness [https://en.wikipedia.org/wiki/Skewness] and brings several problems for your model, so in general you would like to avoid that.

That said, it is just a rule of thumb, you could have the happy case where the class with fewer data points is actually properly represented and the other ones are redundant, in that case the difference in the number of data-points for each class might not be critical.

The main problem is that it can be hard to tell a priori if the data is balanced in term of representation, so the best approach is trying to maintain your data points balanced. Also, some algorithms are sensitive to asymmetric data so even if the data does represent properly the space, the imbalance might introduce bias to the model.

Here are some links that might be helpful:

http://people.stern.nyu.edu/fprovost/Papers/skew.PDF

http://etabeta.univ.trieste.it/dspace/bitstream/10077/4002/1/Menardi%20Torelli%20DEAMS%20WPS2.pdf

http://florianhartl.com/thoughts-on-machine-learning-dealing-with-skewed-classes.html

Upvotes: 1

Related Questions