Reputation: 434

Is the number of the predicted values correct from the test set for SVM?

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).

When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163. Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?

Some fake data:

featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)

Some of the code:


library(data.table)

pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate 

ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"

## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)

train <- ft[train_ind,1:22] #163 reads
test  <- ft[-train_ind,1:22] #55 reads

trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels

#ftID <- cbind(ft, seq.int(nrow(ft))
#colnames(ftID)[24]<- "RowID"
#ftIDtestrows <- ftID[-train_ind,24]

#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)

#Use the predictions on the data
# ---------------- This is where the question is ---------------- #
pred <- predict(model_svm, test)
# ----------------------------------------------------------------#

print(confusionMatrix(pred[1:nrow(test)],testlabel))

#ROC and AUC curves and their plots
#-----------------also------------->  was trying to get this to work as pred doesn't naturally end up with the expected 55 nrow from test set
roc.multi<-multiclass.roc(testlabel, as.numeric(pred[1:55])) 
rs <- roc.multi[['rocs']]
plot.roc(rs[[1]])
sapply(2:length(rs),function(i) lines.roc(rs[[i]],col=i)) ```


 [1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
 [2]: https://github.com/nagdevAmruthnath

Upvotes: 0

Answers (2)

SqueakyBeak

Reputation: 434

Ok I realized that I was training the model on my train data set and then testing it on my test set. I needed to test it first on re-predicting the train set, and then feed it into the test set later.

 summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, train)

model_svm <- svm(trainlabel ~ as.matrix(test) )
 summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, test)```

Upvotes: 0

Not_Dave

Reputation: 501

I was actually able to get the result as 55 rows using the following code. Some of the changes I made were for pt2 instead of as.character I made it into as.factor and instead of pred <- predict(model_svm, test) to pred <- predict(model_svm, as.matrix(test)).

# load libraries
library(data.table)
library(e1071)

# create dataset with random values
featuredata_all <- matrix(rnorm(23*218), ncol=23)

# scale features
pt1 <- scale(featuredata_all[,1:22],center=T)

# make column as factor
pt2 <- as.factor(ifelse(featuredata_all[,23]>0, 0,1)) #since the label is a string I kept it separate 

# join data (optional)
ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"

## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)

# split data to train
train <- ft[train_ind,1:22] #163 reads
test  <- ft[-train_ind,1:22] #55 reads
dim(train)
# [1] 163  22

dim(test)
# [1] 55  22

# split data to test
trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels
length(trainlabel)
[1] 163

length(testlabel)
[1] 55

#Support Vector Machine for classification
model_svm <- svm(x= as.matrix(train), y = trainlabel, probability = T)
summary(model_svm)

# Call:
#   svm.default(x = as.matrix(train), y = trainlabel, probability = T)
# 
# 
# Parameters:
#   SVM-Type:  C-classification 
# SVM-Kernel:  radial 
# cost:  1 
# 
# Number of Support Vectors:  159
# 
# ( 78 81 )
# 
# 
# Number of Classes:  2 
# 
# Levels: 
#   0 1

#Use the predictions on the data
# ---------------- This is where the question is ---------------- #
pred <- predict(model_svm, as.matrix(test))
length(pred)
# [1] 55
# ----------------------------------------------------------------#

print(table(pred[1:nrow(test)],testlabel))
#    testlabel
#    0  1
# 0 14 14
# 1 11 16

Hope this helps.

Upvotes: 1

Is the number of the predicted values correct from the test set for SVM?

Answers (2)

Related Questions