Reputation: 4682
I am building a model for classifying malignant breast tumors using LIBSVM. Here is the algorithm I am following:
The problem is that I am getting a 100% accuracy on a 80:20 dataset using LIBSVM. I've not excluded any feature, and I am NOT training and testing on the same data. Any hints where I could be wrong? Here are some other relevant information:
cost = [2^-10, 2^-8, 2^-6, 2^-4, 2^-2, 0.5, 1,
2, 2^2, 2^3, 2^4, 2^5, 2^6, 2^7, 2^8, 2^9, 2^10];
g = [2^-10, 2^-8, 2^-6, 2^-4, 2^-2, 2^-1, 1,
2, 2^2, 2^3, 2^4, 2^5, 2^6, 2^7, 2^8, 2^9, 2^10];
most optimal C = 1;
most optimal gamma = 9.7656e-04;
Accuracy on 50:50 test:train dataset: 98.5337%
Accuracy on 70:30 test:train dataset: 99.5122%
Dataset used: University of Wisconsin breast cancer dataset (682 entries).
Upvotes: 0
Views: 534
Reputation: 77910
Summary: You didn't complain about the other two data sets; the 100% accuracy is reasonably consistent with those. What makes you think you should have a lower accuracy?
Let's look at the counts of misclassification:
50:50 data set -- 5 / 341 errors
70:30 data set -- 1 / 205 errors
80:20 data set -- 0 / 136 errors
The 80:20 results are sufficiently consistent with your prior results: your accuracy has increased to (apparently) something over 99.8%.
Demanding maximum accuracy from your training suggests that it may well retain all features, with a distinct danger of over-fitting. However, since you apparently find that the first two data sets are acceptable, I intuit that the data set is highly self-consistent. I find that consistency odd from my experience, but you don't describe the data set's properties, or even give us samples or a useful link to check.
Upvotes: 3