jld
jld

Reputation: 476

Basic SVM issues with e1071: test error rate doesn't match up with tune's results

This seems like a very basic question but I can't seem to find the answer anywhere. I'm new to SVMs and ML in general and am trying to do a few simple exercises but the results don't seem to match up. I'm using e1071 with R and have been going through An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani.

My question: why is it that when I use predict I don't seem to have any classification errors and yet the results of the tune function indicate a non-zero error rate? My code (I'm looking at three classes):

set.seed(4)
dat <- data.frame(pop = rnorm(900, c(0,3,6), 1), strat = factor(rep(c(0,1,2), times=300)))
ind <- sample(1:900)
train <- dat[ind[1:600],]
test <- dat[ind[601:900],]

tune1 <- tune(svm, train.x=train[,1], train.y=train[,2], kernel="radial", ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))
svm.tuned <- svm(train[,2]~., data=train, kernel = "radial",  cost=10, gamma=1) # I just entered the optimal cost and gamma values returned by tune
test.pred <- predict(svm.tuned, newdata=data.frame(pop=test[,1],strat=test[,2]))

So when I look at test.pred I see that every value matches up with the true class labels. Yet when I tuned the model it gave an error rate of around 0.06, and either way a test error rate of 0 seems absurd for nonseparable data (unless I'm wrong about this not being separable?). Any clarification would be tremendously helpful. Thanks a lot.

Upvotes: 4

Views: 4739

Answers (1)

lejlot
lejlot

Reputation: 66775

tune functions performs 10 cross validation. It randomly splits your training data into 10 parts and then iteratively:

  • selects each of them and call it "validation set"
  • select remaining 9 and call them "training set"
  • it trains the SVM with given parameters on the training set, and checks how well it works on validation set
  • computes mean error across these 10 "folds"

The information from the "tune" function is this mean error. Once the best parameters are chosen, you are training your model on the whole set, which is exactly 1/9 bigger then the ones used for tuning. As a result, in your particular case (it does not happen often) - you get the classifier which perfectly predicts your "test" set, and some of the smaller ones trianed while tuning - made a small mistake(s) - this is why you get the information regarding different errors.

UPDATE

It seems, that you are actually also training your model on both inputs and labels.. Look at your

svm.tuned$SV

variable, which holds the support vectors.

To train svm, simply run

svm(x,y,kernel="...",...)

for example

svm(train$pop, train$strat, kernel="linear" )

which results in some missclassifications (as expected, as linear kernel cannot perfectly separate such data).

Or using your notation

svm.tuned <- svm(strat~., data=train, kernel = "radial",  cost=10, gamma=1)

note that you should use the name of the frame column strat, not the index.

Upvotes: 6

Related Questions