Reputation: 3029
I am curious about how does the sample size affect a classifiers performance in a multi-label classification. I performed an experiment and realized that for some classifiers like Naive Bayes, the sample size doesn't really seem to affect its accuracy score much.
My question is- why does sample size only affect some classifiers like Decision Trees or SVM?
Upvotes: 1
Views: 613
Reputation: 66805
Actually the problem has nothing to do with multi-label setting. It is true for any learning task - classification, regression, anything.
Sample size affects classifiers that are consistant (the ones that converge to the true, underlying distribution given large enough sample size). In other words - it affects classifiers able to overfit, those with high variance and low bias.
Naive Bayes will always model your distribution in a very simple sense, it has an extremely strong bias - assumption about the shape of your data. Similar argument applies to linear SVM, it will also get some score and after that stop getting stronger even if you add more points. Simply, class of models they are analyzing is too small to represent actual relation. You can think about it in terms of teaching thing to three kinds of animals:
You teach them to avoid pain - they all do it perfectly. Then, you add new points (new data), now you teach them to "fetch", bugs fail, no matter how many times you show them how to fetch. They are simple not capable of doing so... now you move on to teaching for to compute the logarithm... and dogs fail while humans succed (after showing large amount of data).
Now, if you use something like SVM with RBF kernel, it is known to be consistant, it will approximate any "well behaved" distribution. So if your problem is solvable, and you give it enough data, it will solve it nearly perfectly.
Upvotes: 2