Reputation: 839
I try to predict with a simplified KNN model using the caret package in R. It always gives the same error, even in the very simple reproducible example here:
library(caret)
set.seed(1)
#generate training dataset "a"
n = 10000
a = matrix(rnorm(n*8,sd=1000000),nrow = n)
y = round(runif(n))
a = cbind(y,a)
a = as.data.frame(a)
a[,1] = as.factor(a[,1])
colnames(a) = c("y",paste0("V",1:8))
#estimate simple KNN model
ctrl <- trainControl(method="none",repeats = 1)
knnFit <- train(y ~ ., data = a, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneGrid = data.frame(k = 10))
#predict on the training dataset (=useless, but should work)
knnPredict <- predict(knnFit,newdata = a, type="prob")
This gives
Error in [.data.frame
(out, , obsLevels, drop = FALSE) :
undefined columns selected
Defining a more realistic test dataset "b" without the target variable y...
#generate test dataset
b = matrix(rnorm(n*8,sd=1000000),nrow = n)
b = as.data.frame(b)
colnames(b) = c(paste0("V",1:8))
#predict on the test datase
knnPredict <- predict(knnFit,newdata = b, type="prob")
gives the same error
Error in [.data.frame
(out, , obsLevels, drop = FALSE) :
undefined columns selected
I know that the columnames are important, but here they are identical. What is wrong here? Thanks!
Upvotes: 0
Views: 3298
Reputation: 23598
The problem is your y variable. When you are asking for the class probabilities, the train and / or the predict function puts them into a data frame with a column for each class. If the factor levels are not valid variable names, they are automatically changed (e.g. "0" becomes "X0"). See also this post.
If you change this line in your code it should work:
a[,1] = factor(a[,1], labels = c("no", "yes"))
Upvotes: 1