aaron
aaron

Reputation: 6489

Error when using NA levels for prediction in randomForest

All,

Consider the following example:

Y <- iris[, 1]
X <- iris[, 2:5]
X[seq(10, 150, 10), 4] <- NA
X[, 4] <- addNA(X[, 4])
fit <- randomForest(X, Y)
predict(fit) #..Works fine
predict(fit, newdata = X) #..Throws an error

Error in predict.randomForest(fit, newdata = X) : 
  Type of predictors in new data do not match that of the training data.

Even though NAs are explicitly defined as a factor level it still doesn't work with predict.randomForest. Do I have any option other than manually recoding the NAs, since addNA doesn't seem to be working the way I expected it to?

Cheers,

Aaron

Upvotes: 0

Views: 968

Answers (2)

aaron
aaron

Reputation: 6489

I couldn't find a way to use new data contains NA factor levels added with addNA. If you want to treat missingness as a factor level for new predicted data then what worked for me was to manually recode NA as "na" prior to defining the character vector as a factor. Performing this step at both the model training and test phases allowed me to get the result I was looking for.

Y <- iris[, 1]
X <- iris[, 2:5]
X[seq(10, 150, 10), 4] <- NA
X[, 4] <- as.character(X[, 4])
X[is.na(X[, 4]), 4] <- 'na'
X[, 4] <- factor(X[, 4])
fit <- randomForest(X, Y)
predict(fit, newdata = X)

Upvotes: 0

MrFlick
MrFlick

Reputation: 206187

Well, generally if you want to get the predictions to the data you used to generate your model, you just call predict without the newdata= parameter. Does that work in this case?

But i'm assuming that's not what you really wanted to do, and you did in fact want to predict to new data. It really doesn't help giving an example that works, we need a reproducible example of what doesn't work. But after looking at this question (https://stats.stackexchange.com/questions/62015/prediction-with-randomforest-r-and-missing-values) it seems like it might be do to the NA values as you predicted.

Upvotes: 1

Related Questions