Reputation: 6615
Is there anyway for R to 'gracefully' ignore errors that would normally completely crash the prediction when there are new factor levels in the test set? Normally if there is just 1 bad value the entire operation doesnt' work
So that the predictions would occur were there are valid values, but when there are new factor levels an error would occur?
really crappy example but... here is what I'm getting at
library(randomForest)
df=mtcars
df$vs=99
df[1,8]=0 # vs column
df$vs=factor(df$vs)
mtcars$vs=factor(mtcars$vs)
fit=lm(mpg~., data=mtcars)
# fit above works with explanation given below, but fit2 fails with randomforest? why?
fit2 = randomForest(mpg~., data=mtcars)
df$help=predict(fit, df) # first row should work others should error gracefully maybe with a NA?
First response I got has been great. However, it still fails for a less simplistic example with randomForest above.
Upvotes: 1
Views: 587
Reputation: 60472
You could use a tryCatch
to return an NA
when predicting.
For a single row:
tryCatch(predict(fit, bad_df[1,]),
error=function(e) NA))
For all rows:
sapply(1:nrow(bad_df),
function(i)
tryCatch(predict(fit, bad_df[i,]),
error=function(e) NA))
An alternative is to change you data set. Basically, the factors in your data set that don't match your fit
object are set to NA
:
for(i in 1:length(fit$xlevels)) {
bad_values = which(!(bad_df[,names(fit$xlevels)[i]] %in% fit$xlevels[[i]]))
bad_df[, bad_values] = NA
}
Upvotes: 4