Reputation: 526
First time I'm using R and e1071 package and SVM multiclass! I'm very confused, then. The goal is: if I have a sentence with sunny; it will be classified as "yes" sentence; if I have a sentence with cloud, it will be classified as "maybe", if I have a sentence with rainy; il will be classified ad "no". The true goal is to do some text classification for my research.
I have two files:
Example:
V1 V2
1 sunny yes
2 sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5 rainy no
6 rainy rainy no
7 rainy sunny rainy no
8 rainy cloud rainy no
9 cloud maybe
10 cloud cloud maybe
11 cloud rainy cloud maybe
12 cloud sunny cloud maybe
Example:
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a
6 b
7 cloud
8 d
9 e
10 f
11 g
12 hello
Following the examples for the iris dataset (https://cran.r-project.org/web/packages/e1071/e1071.pdf and http://rischanlab.github.io/SVM.html) I created my model and then test the training data in this way:
> library(e1071)
> train <- read.csv(file="C:/Users/Stef/Desktop/train.csv", sep = ";", header = FALSE)
> test <- read.csv(file="C:/Users/Stef/Desktop/test.csv", sep = ";", header = FALSE)
> attach(train)
> x <- subset(train, select=-V2)
> y <- V2
> model <- svm(V2 ~ ., data = train, probability=TRUE)
> summary(model)
Call:
svm(formula = V2 ~ ., data = train, probability = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.08333333
Number of Support Vectors: 12
( 4 4 4 )
Number of Classes: 3
Levels:
maybe no yes
> pred <- predict(model,x)
> system.time(pred <- predict(model,x))
user system elapsed
0 0 0
> table(pred,y)
y
pred maybe no yes
maybe 4 0 0
no 0 4 0
yes 0 0 4
> pred
1 2 3 4 5 6 7 8 9 10 11 12
yes yes yes yes no no no no maybe maybe maybe maybe
Levels: maybe no yes
I think it's ok until now. Now the question is: what about the test data? I didn't find anything for the test data. Then, I thought that maybe I should test the model with the test data. And I did this:
> test
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a
6 b
7 cloud
8 d
9 e
10 f
11 g
12 hello
> z <- subset(test, select=V1)
> pred <-predict(model,z)
Error in predict.svm(model, z) : test data does not match model !
What is wrong here? Can you please explain me how can I test new data using the old train model? Thank you
EDIT
These are the first 5 rows for each file .csv
> head(train,5)
V1 V2
1 sunny yes
2 sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5 rainy no
> head(test,5)
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a
Upvotes: 1
Views: 1580
Reputation: 11955
Factors in train and test dataset are different here so you would need to fix it first.
library(e1071)
#sample data
train_data <- data.frame(V1 = c("sunny","sunny sunny","rainy","rainy rainy","cloud","cloud cloud"),
V2= c("yes","yes","no","no","maybe","maybe"))
test_data <- data.frame(V1 = c("sunny","rainy","hello","cloud"))
#fix levels in train_data & test_data dataset before running model
train_data$ind <- "train"
test_data$ind <- "test"
merged_data <- rbind(train_data[,-grep("V2", colnames(train_data))],test_data)
#train data
train <- merged_data[merged_data$ind=="train",]
train$V2 <- train_data$V2
train <- train[,-grep("ind", colnames(train))]
#test data
test <- merged_data[merged_data$ind=="test",]
test <- data.frame(V1 = test[,-grep("ind", colnames(test))])
#svm model
svm_model <- svm(V2 ~ ., data = train, probability=TRUE)
summary(svm_model)
train_pred <- predict(svm_model,train["V1"])
table(train_pred,train$V2)
#prediction on test data
test$test_pred <- predict(svm_model,test)
test
Hope this helps!
Upvotes: 1
Reputation: 941
I think the problem may be with your select argument to the subset function - what happens if you just execute pred<-predict(model,test)
? It's a bit hard to tell whether your original data has two columns (V1,V2) or up to four. Since you trained/initialized the model with data=train
, I think predicting on test instead of subset(test,) should resolve the issue.
Predict will work on SVM's even if the number of rows in the test set is different than the number of rows the SVM was trained on ... it should be trivial. something like:
test.preds<-predict(some.svm, test)
misclassification.rate<-mean(test.preds != test$V2)
Upvotes: 0