Reputation: 97
I'm experimenting with PCA and Naive Bayes Classifier in Python.
In short, using a database of gray-scale images of digits, I'm reducing dimensions with PCA and then using Naive Bayes to classify.
I use 2,4,10,30,60,200,500,784 components respectively. The classification error rates I get respectively are: 0.25806452, 0.15322581, 0.06290323, 0.06451613, 0.06451613, 0.10322581, 0.28064516 and 0.31774194. I thought that taking more components always improved the accuracy of classification. Is this true? If so then I am doing something wrong.
Upvotes: 0
Views: 2030
Reputation: 2231
It is true that reducing dimensions reduces overfitting, but there is always an optimal number of components which gives the best accuracy if you are not adding additional data to the dataset. In your case, it is 10 since it gives the least error rate of 0.06290323
. So, if you are increasing dimensionality you should also increase the dataset for training in order to expect more accuracy. Otherwise, You should try a Grid search near it for more accuracy.
Also if your dataset is balanced then accuracy
may be a good measure of evaluating your performance. In case of imbalanced dataset try Precision, Recall or f-score .
If still, you are not satisfied with the classifier use some other classification algorithm.
Upvotes: 1
Reputation: 1251
I don't think there is a single valid answer to your question, but reducing the dimensionality of your input can prevent overfitting. More features does not always make your classifier more accurate. You can look here for a detailed explanation: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
Upvotes: 1