Reputation: 4642
I am using the UCI ML breast cancer dataset to build a classifier using SVMs. I am using LIBSVM, and its fselect.py script for calculating f-scores for feature selection. My dataset has 8 features, and the scores for them are following:
5: 1.765716
2: 1.413180
1: 1.320096
6: 1.103449
8: 0.790712
3: 0.734230
7: 0.698571
4: 0.580819
This implies that the 5th feature is the most discriminative, and 4th is the least. My next piece of code looks something like this:
x1=x(:,5);
x2=x(:,[5,2]);
x3=x(:,[5,2,6]);
x4=x(:,[5,2,6,8]);
x5=x(:,[5,2,6,8,3]);
x6=x(:,[5,2,6,8,3,7]);
x7=x(:,[5,2,6,8,3,7,4]);
errors2=zeros(7,1);
errors2(1)=svmtrain(y,x1,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(2)=svmtrain(y,x2,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(3)=svmtrain(y,x3,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(4)=svmtrain(y,x4,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(5)=svmtrain(y,x5,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(6)=svmtrain(y,x6,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(7)=svmtrain(y,x7,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
Note: gamma and C were computed using grid search and x is the complete matrix with 8 columns (corresponding to 8 features)
When I print the errors2 matrix, I get the following output:
errors2 =
88.416
92.229
93.109
94.135
94.282
94.575
94.575
This means that I get the most accuracy when I use all the features and get the least accuracy when I use the most discriminating feature. As far as I know, I should get the most accuracy when I use a subset of features containing the most discriminating one. Why is the program behaving this way then? Can someone point out any errors that I might have made? (My intuition says that I've calculated the C wrong, since it is so small).
Upvotes: 0
Views: 219
Reputation: 1131
The error rates you are getting are as would be expected. Adding an extra feature should reduce the error rate, because you have more information.
As an example, consider trying to work out what model a car is. The most discriminative feature is probably the manufacturer, but adding features such as engine size, height, width, length, weight etc will narrow it down further.
If you are considering lots of features, some of which may have very low discriminative power, you might run into problems with overfitting to your training data. Here you have just 8 features, but it already looks like adding the 8th feature has no effect. (In the car example, this might be features such as how dirty the car is, amount of tread left on the tyres, the channel the radio is tuned to, etc).
Upvotes: 1