Rachel Zhang
Rachel Zhang

Reputation: 564

R: e1071 svm function - is it necessary to convert categorical to dummies?

I know svm model needs preprocessing that converts categorical variables into dummy variables. However, when I am using e1071's svm function to fit a model with unconverted data (see train and test), no error pops up. I am assuming the function automatically converts them.

However, when I am using the converted data (see train2 and test2) to fit a svm model, this function gives me a different result (as indicated, p1 and p2 are not the same).

Could anyone let me know what happened to the unconverted data? Does the function just ignore the categorical variables, or something else happened?

library(e1071)
library(dummies)

set.seed(0)
x = data.frame(matrix(rnorm(200, 10, 10), ncol = 5))   #fake numerical predictors
cate = factor(sample(LETTERS[1:5], 40, replace=TRUE))  #fake categorical variables
y = rnorm(40, 50, 10)             #fake response

data = cbind(y,cate,x)
ind = sample(40, 30, replace=FALSE)
train = data[ind, ]
test = data[-ind, ]

#without dummy 
data = cbind(y,cate,x)
svm.model = svm(y~., train)
p1 = predict(svm.model, test)

#with dummy
train2 = cbind(train[,-2], dummy(train[,2]))
colnames(train2) = c('y', paste0('X',1:5), LETTERS[1:4])
test2 = cbind(test[,-2], dummy(test[,2]))
colnames(test2) = c('y', paste0('X',1:5), LETTERS[1:4])
svm.model2 = svm(y~., train2)
p2 = predict(svm.model2, test2)

Upvotes: 1

Views: 2234

Answers (2)

Oliver
Oliver

Reputation: 8592

What you're observing is indeed as you stated, that dummies are converted automatically. In fact we can reproduce both svm.model1 and svm.model2 quite easily.

mf <- model.frame(y ~ . - 1, train) # - 1 because the intercept is unused in svm.
mt <- terms(mf)
X <- model.matrix(mt, mf)
Xtest <- model.matrix(mt, test)
Y <- model.response(mf)
svm.model3 <- svm(X, Y)

Note that i did not use svm(formula, data) but svm(x, y). Now which model did we actually recreate? Lets compare with p1 and p2

all.equal(p1, predict(svm.model3, newdata = Xtest))
# [1] "Mean relative difference: 0.03064692"
all.equal(p2, predict(svm.model3, newdata = Xtest))
# [1] TRUE

It seems we've recreated model 2, with our manual dummies. Now the reason why this reproduces svm.model2 and not svm.model1 is that due to the scale parameter. From help(svm) (note the part in bold)

A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.

From this we can see that likely the difference (and issue really) comes from svm not correctly identifying binary columns as dummies, but apparently being smart enough to do this when performing automatic conversion. We can test this theory by setting the scale parameter manually

#labels(mt) = 'cate', 'X1', 'X2', ... 
#names(attr(X, 'constrasts')) = 'cate' 
#eg: scale = Anything but 'cate'
not_dummies <- !(labels(mt) %in% names(attr(X, 'contrasts')))
n <- table(attr(X, 'assign'))
scale <- rep(not_dummies, n)
svm.model4 <- svm(X, Y, scale = scale)
all.equal(p1, predict(svm.model4, newdata = Xtest))
# [1] TRUE
all.equal(p2, predict(svm.model4, newdata = Xtest))
# [1] "Mean relative difference: 0.03124989"

So what we see is, that

1) svm as stated converts factors into dummy variables automatically.

2) It does however, in the case dummies are provided, not check for these, causing possibly unexpected behaviour if one manually creates these.

Upvotes: 2

Chuck P
Chuck P

Reputation: 3923

From the documentation it is clear that it is treated at least slightly differently, hence the comment "If the predictor variables include factors, the formula interface must be used to get a correct model matrix.".

Personal hunch the differences have to do with scaling (the default in svm). Note the difference between...

> svm.model$x.scale$`scaled:center`
       X1        X2        X3        X4        X5 
10.091157  8.739654 10.395121  7.856475 11.660454 
> svm.model2$x.scale$`scaled:center`
        X1         X2         X3         X4         X5          A          B          C          D      X.NA. 
10.0911569  8.7396541 10.3951208  7.8564754 11.6604540  0.2000000  0.1333333  0.1333333  0.2333333  0.3000000 

Upvotes: 1

Related Questions