Stéphane Laurent
Stéphane Laurent

Reputation: 84529

Tuning of mtry by caret returning strange value

I tune the mtry parameter of randomForest using the train function from the caret package. There are only 48 columns in my X data, however train returns mtry=50 as the best value whereas this is not a valid value (>48). What is the explanation of that ?

> dim(X)
[1] 93 48
> fit <- train(level~., data=data.frame(X,level), tuneLength=13) 
> fit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 50

        OOB estimate of  error rate: 2.15%
Confusion matrix:
     high low class.error
high   81   1  0.01219512
low     1  10  0.09090909

It is even worse if I don't set the tuneLength parameter:

> fit <- train(level~., data=data.frame(X,level)) 
> fit$finalModel 

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 55

        OOB estimate of  error rate: 2.15%
Confusion matrix:
     high low class.error
high   81   1  0.01219512
low     1  10  0.09090909

I don't provide the data cause it is confidential. But there's nothing special in these data: each column is numerical or is a factor, and there are no missing value.

Upvotes: 3

Views: 2880

Answers (1)

topepo
topepo

Reputation: 14316

The apparent discrepancy is most likely[1] between the number of columns in your data set and the number of predictors, which may not be the same if any of the columns are factors. You used the formula method, which will expand the factors into dummy variables. For example:

> head(model.matrix(Sepal.Width ~ ., data = iris))
  (Intercept) Sepal.Length Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1           1          5.1          1.4         0.2                 0                0
2           1          4.9          1.4         0.2                 0                0
3           1          4.7          1.3         0.2                 0                0
4           1          4.6          1.5         0.2                 0                0
5           1          5.0          1.4         0.2                 0                0
6           1          5.4          1.7         0.4                 0                0

So there are 3 predictor columns in iris but you end up with 5 (non-intercept) predictors.

Max

[1] This is why you need to provide a reproducible example. Often, when I get ready to ask a question, the answer becomes apparent while I take the time to write a good description of the issue.

Upvotes: 6

Related Questions