Reputation: 14374
I'm trying to understand the difference between the random forest implementation in the randomForest
package and in the caret
package.
For example, this specifies 2000 trees with mtry = 2
in randomForest
and I show the Gini coefficient for each predictor:
library(randomForest)
library(tidyr)
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(RF = sort(importance(rf1)[, "MeanDecreaseGini"], decreasing = TRUE)) %>% add_rownames() %>% rename(Predictor = rowname)
# Predictor RF
# 1 Petal.Width 45.57974
# 2 Petal.Length 41.61171
# 3 Sepal.Length 9.59369
# 4 Sepal.Width 2.47010
I'm trying to get the same info in caret
, but I don't know how to specify the number of trees, or how to get the Gini coefficient:
rf2 <- train(Species ~ ., data = iris, method = "rf",
metric = "Kappa",
tuneGrid = data.frame(mtry = 2))
varImp(rf2) # not the Gini coefficient
# Overall
# Petal.Length 100.000
# Petal.Width 99.307
# Sepal.Width 0.431
# qSepal.Length 0.000
Also, the confusion matrix of rf1
has some errors and that of rf2
doesn't. What parameter is causing this difference?:
# rf1 Confusion matrix:
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 4 46 0.08
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
This is quick and dirty. I know this isn't the right way to test the performance of the classifier, but I dont' understand the difference in the results.
Upvotes: 2
Views: 4334
Reputation: 1
I was also recently looking for a solution to get the MeanDecreasingGini
variable from the caret
implementation of randomForest
. I realize this was posted long ago so perhaps caret has updated and my advice is no longer necessary, but I struggled to find a solution so hopefully someone finds this useful.
To set the number of trees in caret you use the ntrees=xx
argument during training just like you would with randomForest
. Then to output the MeanDecreasingGini
in caret
specify type=2
(1=MeanDecreasingAccuracy
[default], 2=MeanDecreasingGini
) and scale=FALSE
. Full code with results below (after several runs there are minor fluctuations in the magnitude of results which I am predicting is random chance, but rank of variables is consistent):
library(randomForest)
library(tidyr)
library(caret)
##randomForest
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(Gini=sort(importance(rf1, type=2)[,], decreasing=T))
# Gini
# Petal.Width 43.924705
# Petal.Length 43.293731
# Sepal.Length 9.717544
# Sepal.Width 2.320682
##caret
rf2 <- train(Species ~ .,
data = iris,
method = "rf",
ntrees=2000, ##same as randomForest
importance=TRUE, ##same as randomForest
metric = "Kappa",
tuneGrid = data.frame(mtry = 2),
trControl = trainControl(method = "none")) ##Stop the default bootstrap=25
varImp(rf2, type=2, scale=FALSE)
# rf variable importance
#
# Overall
# Petal.Width 44.475
# Petal.Length 43.401
# Sepal.Length 9.140
# Sepal.Width 2.267
Then in terms confusion matrix confusion (confusing phrasing?), this seems to be a byproduct of the way you were calculating the confusion matrices. When I used the predict function for both models, I moved to 100% accuracy versus when I used other methods:
rf1$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 3 47 0.06
table(predict(rf1, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
rf2$finalModel$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 5 45 0.10
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
However, I am not sure if rf1$confusion
and rf2$finalModel$confusion
both represent the last tree's predictions. Perhaps someone with a better grasp of this could help out.
Upvotes: 0
Reputation: 41
This might help to answer the question - see 2nd post:
caret: using random forest and include cross-validation
randomforest is sampling with replacement. If you use "rf" in caret, you need to specify trControl in train::caret(); you want the same resampling method to be used in caret i.e. a bootstrap, so you need to set trControl="oob". TrControl is a list of values that defines how the function acts; this can be set to "cv" for cross validation, "repeatedcv" for repeated cross validation etc. See the caret package documentation for more info.
You should get the same result as using randomForest, but do remember to set the seeds properly.
Upvotes: 2