Reputation: 84529
I tune the mtry
parameter of randomForest
using the train
function from the caret
package. There are only 48
columns in my X
data, however train
returns mtry=50
as the best value whereas this is not a valid value (>48
). What is the explanation of that ?
> dim(X)
[1] 93 48
> fit <- train(level~., data=data.frame(X,level), tuneLength=13)
> fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 50
OOB estimate of error rate: 2.15%
Confusion matrix:
high low class.error
high 81 1 0.01219512
low 1 10 0.09090909
It is even worse if I don't set the tuneLength
parameter:
> fit <- train(level~., data=data.frame(X,level))
> fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 55
OOB estimate of error rate: 2.15%
Confusion matrix:
high low class.error
high 81 1 0.01219512
low 1 10 0.09090909
I don't provide the data cause it is confidential. But there's nothing special in these data: each column is numerical or is a factor, and there are no missing value.
Upvotes: 3
Views: 2880
Reputation: 14316
The apparent discrepancy is most likely[1] between the number of columns in your data set and the number of predictors, which may not be the same if any of the columns are factors. You used the formula method, which will expand the factors into dummy variables. For example:
> head(model.matrix(Sepal.Width ~ ., data = iris))
(Intercept) Sepal.Length Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1 1 5.1 1.4 0.2 0 0
2 1 4.9 1.4 0.2 0 0
3 1 4.7 1.3 0.2 0 0
4 1 4.6 1.5 0.2 0 0
5 1 5.0 1.4 0.2 0 0
6 1 5.4 1.7 0.4 0 0
So there are 3 predictor columns in iris
but you end up with 5 (non-intercept) predictors.
Max
[1] This is why you need to provide a reproducible example. Often, when I get ready to ask a question, the answer becomes apparent while I take the time to write a good description of the issue.
Upvotes: 6