StefanK
StefanK

Reputation: 2180

Missing object in randomForest model when predicting on test dataset

Sorry if it was already asked, but I couldn't find it in half an hour of looking, so I would appreciate if you can point me to some direction.

I have a trouble with missing object in the model, while I don't actually use this object when building the model, it's just present in the dataset. (as you can see in the example below).

It is a problem, because I have already trained some rf models, I am loading the models into environment and I am reusing them as they are. The test dataset doesn't contain some variables that are present in dataset upon which the model was built, but they are not used in the model itself!

library(randomForest)
data(iris)

smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)

train <- iris[train_ind, ]
test <- iris[-train_ind, ]

test$Sepal.Length <- NULL  # for the sake of example I drop this column

rf_model <- randomForest(Species ~ . - Sepal.Length, # I don't use the column in training model
                         data = train)

rf_prediction <- predict(rf_model, newdata = test)

When I try to predict on test dataset, I get an error:

Error in eval(expr, envir, enclos) : object 'Sepal.Length' not found

What I hope to achieve, is use the models I have already built, as redoing them without missing variables would be costly.

Thanks for advice!

Upvotes: 0

Views: 497

Answers (1)

Ian Wesley
Ian Wesley

Reputation: 3624

As your models are already built. You will want to add missing columns back on to the test set before running the model. Just add the missing columns with a value of 0 as in the following exmaple.

library(randomForest)
library(dplyr)
data(iris)

smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)

train <- iris[train_ind, ]
test <- iris[-train_ind, ]

test$Sepal.Length <- NULL  

rf_model <- randomForest(Species ~ . - Sepal.Length, 
                         data = train)

# adding the missing column to your test set.
missingColumns <- setdiff(colnames(train),colnames(test))
test[,missingColumns] <- 0 



rf_prediction <- predict(rf_model, newdata = test)

rf_prediction

#showing this produce the same results
train2 <- iris[train_ind, ]
test2 <- iris[-train_ind, ]

test2$Sepal.Length <- NULL  
train2$Sepal.Length <- NULL  

rf_model2 <- randomForest(Species ~ ., 
                         data = train2)


rf_prediction2 <- predict(rf_model2, newdata = test2)

rf_prediction2 == rf_prediction 

Upvotes: 1

Related Questions