How to auto-exclude unseen new factor levels in predict.randomForest?

Question

I am using ramdomForest package to create a random forest model. May data sets are huge with more than a million observations of 200+ variables. While training the random forest with sample data, I am not able to capture all factor levels of all variables.

So while predicting on validation set using predict() it throws an error as new factor levels are present which are not captured in training data.

One solution is to ensure that training data variables contain all factor levels. But this is turning out to be very tedious and I don't really need all factor levels.

Does there exist a way to auto-exclude observations from validation set which contain previous unidentified factor levels while running predict() in randomForest package? Could find any argument for that in the CRAN document. I don't think I can make a reproducible example for this one.

Amrita Sawant · Accepted Answer

One solution is to combine Train and Test Matrix and use as.factor on the combined matrix. Then separate into Train and Test again. I had faced this same issue in random forest and this solution had worked for me.

for example :

   combine <- rbind(Train,Test)
   combine$var1 <- as.factor(combine$var1)

   ##Then split into Test and Train
   Train$var1 <- combine[1:nrow(train)]

   similar for Test.

Hope this helps!

How to auto-exclude unseen new factor levels in predict.randomForest?

Answers (1)

Related Questions