runningbirds
runningbirds

Reputation: 6615

Regression with new factor levels in test set - how to gracefully ignore error

Is there anyway for R to 'gracefully' ignore errors that would normally completely crash the prediction when there are new factor levels in the test set? Normally if there is just 1 bad value the entire operation doesnt' work

So that the predictions would occur were there are valid values, but when there are new factor levels an error would occur?

really crappy example but... here is what I'm getting at

  library(randomForest)
  df=mtcars
  df$vs=99
  df[1,8]=0  # vs column
  df$vs=factor(df$vs)
  mtcars$vs=factor(mtcars$vs)

  fit=lm(mpg~., data=mtcars)
   # fit above works with explanation given below, but fit2 fails with randomforest?  why?
  fit2 = randomForest(mpg~., data=mtcars)
   df$help=predict(fit, df)   #  first row should work others should error gracefully maybe with a NA?

First response I got has been great. However, it still fails for a less simplistic example with randomForest above.

Upvotes: 1

Views: 587

Answers (1)

csgillespie
csgillespie

Reputation: 60472

You could use a tryCatch to return an NA when predicting.

For a single row:

tryCatch(predict(fit, bad_df[1,]), 
                           error=function(e) NA))

For all rows:

sapply(1:nrow(bad_df), 
           function(i) 
               tryCatch(predict(fit, bad_df[i,]), 
                           error=function(e) NA))

An alternative is to change you data set. Basically, the factors in your data set that don't match your fit object are set to NA:

for(i in 1:length(fit$xlevels)) {
  bad_values = which(!(bad_df[,names(fit$xlevels)[i]] %in% fit$xlevels[[i]]))
  bad_df[, bad_values] = NA
}

Upvotes: 4

Related Questions