Reputation: 28139
I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.
I run the following code to select only lines that have values in the response column:
sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]
I then use sub1 as the data set in a randomForest using this code:
set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)
then I run this code to check the output for NA's:
> length(which(is.na(RF$predicted)))
[1] 65
I can't figure out why I'd be getting NA's if the data going in is clean.
Any suggestions?
Upvotes: 2
Views: 2772
Reputation: 4123
I think you should use more trees. Because predicted
values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.
Upvotes: 5