Reputation: 4642
I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
Upvotes: 5
Views: 24056
Reputation: 1
some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%. My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set? 2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting. Generally, in this case the performance of the SVM classifier on unknown dataset is poor.
Upvotes: 0
Reputation: 98
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier
from sklearn.ensemble
.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
Upvotes: 4