Reputation: 1563
I'm trying to predict a binary classification problem dealing with recommending films.
I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).
I then have a test data set of 20 films with the same columns.
I then run
pred<-predict(svm_model, test)
and receive
Error in predict.svm(svm_model, test) : test data does not match model !.
From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test)
and str(train)
. However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing
levels(test$Attr1) <- levels(train$Attr1)
changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?
The first half dozen rows of my training set are in the following link. https://justpaste.it/1ifsx
Upvotes: 0
Views: 1030
Reputation: 23608
You could do something like this, assuming Attr1 is a character:
Create a factor on train and test attribute1 with all the levels found in point 1.
levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1 <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)
If you do not want factos, add as.integer
to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.
as.integer(factor(test$Attr1, levels=levels))
Upvotes: 0