HOSS_JFL
HOSS_JFL

Reputation: 839

Caret and KNN in R: predict function gives error

I try to predict with a simplified KNN model using the caret package in R. It always gives the same error, even in the very simple reproducible example here:

library(caret)
set.seed(1)

#generate training dataset "a" 
n = 10000
a = matrix(rnorm(n*8,sd=1000000),nrow = n)
y = round(runif(n))
a = cbind(y,a)
a = as.data.frame(a)
a[,1] = as.factor(a[,1])
colnames(a) = c("y",paste0("V",1:8))

#estimate simple KNN model
ctrl <- trainControl(method="none",repeats = 1)
knnFit <- train(y ~ ., data = a, method = "knn", trControl = ctrl, preProcess = c("center","scale"),  tuneGrid = data.frame(k = 10))

#predict on the training dataset (=useless, but should work)
knnPredict <- predict(knnFit,newdata = a,  type="prob")

This gives

Error in [.data.frame(out, , obsLevels, drop = FALSE) : undefined columns selected

Defining a more realistic test dataset "b" without the target variable y...

#generate test dataset
b =  matrix(rnorm(n*8,sd=1000000),nrow = n) 
b = as.data.frame(b)
colnames(b) = c(paste0("V",1:8))

#predict on the test datase
knnPredict <- predict(knnFit,newdata = b,  type="prob")

gives the same error

Error in [.data.frame(out, , obsLevels, drop = FALSE) : undefined columns selected

I know that the columnames are important, but here they are identical. What is wrong here? Thanks!

Upvotes: 0

Views: 3298

Answers (1)

phiver
phiver

Reputation: 23598

The problem is your y variable. When you are asking for the class probabilities, the train and / or the predict function puts them into a data frame with a column for each class. If the factor levels are not valid variable names, they are automatically changed (e.g. "0" becomes "X0"). See also this post.

If you change this line in your code it should work:

a[,1] = factor(a[,1], labels = c("no", "yes"))

Upvotes: 1

Related Questions