user3521568
user3521568

Reputation: 31

randomForest() machine learning in R

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:

dat.rf <- randomForest(dat[,-30], 
                      dat[,30], 
                      proximity=TRUE, 
                      mtry=3,
                      importance=TRUE,
                      do.trace=100,
                      na.action = na.omit)

When I try this, I got the following error messages:

Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : NA not permitted in predictors In addition: Warning message: In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, : The response has five or fewer unique values. Are you sure you want to do regression?

However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.

dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,                          
                      data=dat
                      proximity=TRUE,
                      mtry=3,
                      importance=TRUE,
                      do.trace=100,
                      na.action = na.omit)

Could someone help me debug the simplier command where I don't have to list each predictor one by one?

Upvotes: 3

Views: 2616

Answers (1)

Andy Clifton
Andy Clifton

Reputation: 5056

The error message gives you a clue to two problems:

  1. First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
  2. It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.

So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:

 randomForest(x = dat[,-30],
              y = factor(dat[,30]),
              ...)

This way your output can only take one of the levels given in y.

This is all buried in the description of the arguments $x$ and $y$: see ?help.

Upvotes: 4

Related Questions