Prashant Pandey
Prashant Pandey

Reputation: 4682

100% accuracy in classification with LIBSVM- What could be wrong?

I am building a model for classifying malignant breast tumors using LIBSVM. Here is the algorithm I am following:

  1. Use Backward-elimination for feature selection.
  2. Calculate C and gamma for each set of features using grid search.
  3. Derive the most optimal C and gamma using 10-fold cross validation.
  4. Using the above steps, find the best possible subset of features and the maximum accuracy.

The problem is that I am getting a 100% accuracy on a 80:20 dataset using LIBSVM. I've not excluded any feature, and I am NOT training and testing on the same data. Any hints where I could be wrong? Here are some other relevant information:

cost = [2^-10, 2^-8, 2^-6, 2^-4, 2^-2, 0.5, 1,
        2, 2^2, 2^3, 2^4, 2^5, 2^6, 2^7, 2^8, 2^9, 2^10];
g = [2^-10, 2^-8, 2^-6, 2^-4, 2^-2, 2^-1, 1,
     2, 2^2, 2^3, 2^4, 2^5, 2^6, 2^7, 2^8, 2^9, 2^10];
most optimal C = 1;
most optimal gamma = 9.7656e-04;
Accuracy on 50:50 test:train dataset: 98.5337%
Accuracy on 70:30 test:train dataset: 99.5122%
Dataset used: University of Wisconsin breast cancer dataset (682 entries).

Upvotes: 0

Views: 534

Answers (1)

Prune
Prune

Reputation: 77910

Summary: You didn't complain about the other two data sets; the 100% accuracy is reasonably consistent with those. What makes you think you should have a lower accuracy?

Let's look at the counts of misclassification:

50:50 data set -- 5 / 341 errors
70:30 data set -- 1 / 205 errors
80:20 data set -- 0 / 136 errors

The 80:20 results are sufficiently consistent with your prior results: your accuracy has increased to (apparently) something over 99.8%.

Demanding maximum accuracy from your training suggests that it may well retain all features, with a distinct danger of over-fitting. However, since you apparently find that the first two data sets are acceptable, I intuit that the data set is highly self-consistent. I find that consistency odd from my experience, but you don't describe the data set's properties, or even give us samples or a useful link to check.

Upvotes: 3

Related Questions