Prashant Pandey
Prashant Pandey

Reputation: 4642

Feature selection in SVM classification-Weird behaviour

I am using the UCI ML breast cancer dataset to build a classifier using SVMs. I am using LIBSVM, and its fselect.py script for calculating f-scores for feature selection. My dataset has 8 features, and the scores for them are following:

5:  1.765716
2:  1.413180
1:  1.320096
6:  1.103449
8:  0.790712
3:  0.734230
7:  0.698571
4:  0.580819

This implies that the 5th feature is the most discriminative, and 4th is the least. My next piece of code looks something like this:

x1=x(:,5);
x2=x(:,[5,2]);      
x3=x(:,[5,2,6]);    
x4=x(:,[5,2,6,8]);
x5=x(:,[5,2,6,8,3]);
x6=x(:,[5,2,6,8,3,7]);
x7=x(:,[5,2,6,8,3,7,4]);


errors2=zeros(7,1);

errors2(1)=svmtrain(y,x1,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(2)=svmtrain(y,x2,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(3)=svmtrain(y,x3,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(4)=svmtrain(y,x4,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(5)=svmtrain(y,x5,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(6)=svmtrain(y,x6,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(7)=svmtrain(y,x7,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');

Note: gamma and C were computed using grid search and x is the complete matrix with 8 columns (corresponding to 8 features)

When I print the errors2 matrix, I get the following output:

errors2 =

   88.416
   92.229
   93.109
   94.135
   94.282
   94.575
   94.575

This means that I get the most accuracy when I use all the features and get the least accuracy when I use the most discriminating feature. As far as I know, I should get the most accuracy when I use a subset of features containing the most discriminating one. Why is the program behaving this way then? Can someone point out any errors that I might have made? (My intuition says that I've calculated the C wrong, since it is so small).

Upvotes: 0

Views: 219

Answers (1)

Richard
Richard

Reputation: 1131

The error rates you are getting are as would be expected. Adding an extra feature should reduce the error rate, because you have more information.

As an example, consider trying to work out what model a car is. The most discriminative feature is probably the manufacturer, but adding features such as engine size, height, width, length, weight etc will narrow it down further.

If you are considering lots of features, some of which may have very low discriminative power, you might run into problems with overfitting to your training data. Here you have just 8 features, but it already looks like adding the 8th feature has no effect. (In the car example, this might be features such as how dirty the car is, amount of tread left on the tyres, the channel the radio is tuned to, etc).

Upvotes: 1

Related Questions