Reputation: 31
I am exploring with the function randomforest()
in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:
dat.rf <- randomForest(dat[,-30],
dat[,30],
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
When I try this, I got the following error messages:
Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : NA not permitted in predictors In addition: Warning message: In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : The response has five or fewer unique values. Are you sure you want to do regression?
However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.
dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,
data=dat
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
Could someone help me debug the simplier command where I don't have to list each predictor one by one?
Upvotes: 3
Views: 2616
Reputation: 5056
The error message gives you a clue to two problems:
NA
anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.randomForest()
will automatically apply regression.So, how do you force randomForest()
to use classification?As you noticed in your first try, randomForest
allows you to give data as predictors and response data, not just using the formula style. To force randomForest()
to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]
) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:
randomForest(x = dat[,-30],
y = factor(dat[,30]),
...)
This way your output can only take one of the levels given in y.
This is all buried in the description of the arguments $x$ and $y$: see ?help
.
Upvotes: 4