SVM Predict Levels not matching between test and training data

Question

I'm trying to predict a binary classification problem dealing with recommending films.

I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).

I then have a test data set of 20 films with the same columns.

I then run

pred<-predict(svm_model, test)

and receive

Error in predict.svm(svm_model, test) : test data does not match model !.

From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test) and str(train). However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing

levels(test$Attr1) <- levels(train$Attr1)

changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?

The first half dozen rows of my training set are in the following link. https://justpaste.it/1ifsx

phiver · Accepted Answer

You could do something like this, assuming Attr1 is a character:

Create a levels attribute with the unique values from attribute1 from both test and train.

Create a factor on train and test attribute1 with all the levels found in point 1.

levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1  <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)

If you do not want factos, add as.integer to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.

as.integer(factor(test$Attr1, levels=levels))

SVM Predict Levels not matching between test and training data

Answers (1)

Related Questions