Reputation: 14374

How to get the Gini coefficient using random forests in the caret R package?

I'm trying to understand the difference between the random forest implementation in the randomForest package and in the caret package.

For example, this specifies 2000 trees with mtry = 2 in randomForest and I show the Gini coefficient for each predictor:

library(randomForest)
library(tidyr) 
rf1 <- randomForest(Species ~ ., data = iris, 
                      ntree = 2000, mtry = 2,
                      importance = TRUE)
data.frame(RF = sort(importance(rf1)[, "MeanDecreaseGini"], decreasing = TRUE)) %>% add_rownames() %>% rename(Predictor = rowname)
#      Predictor       RF
# 1  Petal.Width 45.57974
# 2 Petal.Length 41.61171
# 3 Sepal.Length  9.59369
# 4  Sepal.Width  2.47010

I'm trying to get the same info in caret, but I don't know how to specify the number of trees, or how to get the Gini coefficient:

rf2 <- train(Species ~ ., data = iris, method = "rf",
              metric = "Kappa", 
              tuneGrid = data.frame(mtry = 2))
varImp(rf2) # not the Gini coefficient
#              Overall
# Petal.Length 100.000
# Petal.Width   99.307
# Sepal.Width    0.431
# qSepal.Length  0.000

Also, the confusion matrix of rf1 has some errors and that of rf2 doesn't. What parameter is causing this difference?:

# rf1 Confusion matrix:
#            setosa versicolor virginica class.error
# setosa         50          0         0        0.00
# versicolor      0         47         3        0.06
# virginica       0          4        46        0.08

table(predict(rf2, iris), iris$Species)
#             setosa versicolor virginica
#  setosa         50          0         0
#  versicolor      0         50         0
#  virginica       0          0        50

This is quick and dirty. I know this isn't the right way to test the performance of the classifier, but I dont' understand the difference in the results.

Upvotes: 2

Answers (2)

Nicholas K

Reputation: 1

I was also recently looking for a solution to get the MeanDecreasingGini variable from the caret implementation of randomForest. I realize this was posted long ago so perhaps caret has updated and my advice is no longer necessary, but I struggled to find a solution so hopefully someone finds this useful.

To set the number of trees in caret you use the ntrees=xx argument during training just like you would with randomForest. Then to output the MeanDecreasingGini in caret specify type=2 (1=MeanDecreasingAccuracy[default], 2=MeanDecreasingGini) and scale=FALSE. Full code with results below (after several runs there are minor fluctuations in the magnitude of results which I am predicting is random chance, but rank of variables is consistent):

library(randomForest)
library(tidyr) 
library(caret)

##randomForest
rf1 <- randomForest(Species ~ ., data = iris, 
                    ntree = 2000, mtry = 2,
                    importance = TRUE)
data.frame(Gini=sort(importance(rf1, type=2)[,], decreasing=T))
# Gini
# Petal.Width  43.924705
# Petal.Length 43.293731
# Sepal.Length  9.717544
# Sepal.Width   2.320682

##caret
rf2 <- train(Species ~ ., 
             data = iris, 
             method = "rf",
             ntrees=2000, ##same as randomForest
             importance=TRUE, ##same as randomForest
             metric = "Kappa", 
             tuneGrid = data.frame(mtry = 2),
             trControl = trainControl(method = "none")) ##Stop the default bootstrap=25
varImp(rf2, type=2, scale=FALSE)
# rf variable importance
# 
# Overall
# Petal.Width   44.475
# Petal.Length  43.401
# Sepal.Length   9.140
# Sepal.Width    2.267

Then in terms confusion matrix confusion (confusing phrasing?), this seems to be a byproduct of the way you were calculating the confusion matrices. When I used the predict function for both models, I moved to 100% accuracy versus when I used other methods:

rf1$confusion
# setosa versicolor virginica class.error
# setosa         50          0         0        0.00
# versicolor      0         47         3        0.06
# virginica       0          3        47        0.06

table(predict(rf1, iris), iris$Species)
# setosa versicolor virginica
# setosa         50          0         0
# versicolor      0         50         0
# virginica       0          0        50

rf2$finalModel$confusion
# setosa versicolor virginica class.error
# setosa         50          0         0        0.00
# versicolor      0         47         3        0.06
# virginica       0          5        45        0.10

table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa         50          0         0
# versicolor      0         50         0
# virginica       0          0        50

However, I am not sure if rf1$confusion and rf2$finalModel$confusion both represent the last tree's predictions. Perhaps someone with a better grasp of this could help out.

Upvotes: 0

AvocadoTreat

Reputation: 41

This might help to answer the question - see 2nd post:

caret: using random forest and include cross-validation

randomforest is sampling with replacement. If you use "rf" in caret, you need to specify trControl in train::caret(); you want the same resampling method to be used in caret i.e. a bootstrap, so you need to set trControl="oob". TrControl is a list of values that defines how the function acts; this can be set to "cv" for cross validation, "repeatedcv" for repeated cross validation etc. See the caret package documentation for more info.

You should get the same result as using randomForest, but do remember to set the seeds properly.

Upvotes: 2

How to get the Gini coefficient using random forests in the caret R package?

Answers (2)

Related Questions