Rafa OR
Rafa OR

Reputation: 349

Difference between varImp (caret) and importance (randomForest) for Random Forest

I do not understand which is the difference between varImp function (caret package) and importance function (randomForest package) for a Random Forest model:

I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions:

Here is my code:

rfImp <- randomForest(Origin ~ ., data = TAll_CS,
                       ntree = 2000,
                       importance = TRUE)

importance(rfImp)

                                 BREAST       LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3        -1.44116806  2.8918537            1.0929302        0.3712622
Contrast_GLCM_R1SC4NG3      -2.61146974  1.5848150           -0.4455327        0.2446930
Entropy_GLCM_R1SC4NG3       -3.42017102  3.8839464            0.9779201        0.4170445
...

varImp(rfImp)
                                 BREAST        LUNG
Energy_GLCM_R1SC4NG3         0.72534283  0.72534283
Contrast_GLCM_R1SC4NG3      -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3        0.23188771  0.23188771
...

I thought they used the same "algorithm" but I am not sure now.

EDIT

In order to reproduce the problem, the ionosphere dataset (kknn package) can be used:

library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
                       ntree = 2000,
                       importance = TRUE)
importance(rfImp)
             b        g MeanDecreaseAccuracy MeanDecreaseGini
V3  21.3106205 42.23040             42.16524        15.770711
V4  10.9819574 28.55418             29.28955         6.431929
V5  30.8473944 44.99180             46.64411        22.868543
V6  11.1880372 33.01009             33.18346         6.999027
V7  13.3511887 32.22212             32.66688        14.100210
V8  11.8883317 32.41844             33.03005         7.243705
V9  -0.5020035 19.69505             19.54399         2.501567
V10 -2.9051578 22.24136             20.91442         2.953552
V11 -3.9585608 14.68528             14.11102         1.217768
V12  0.8254453 21.17199             20.75337         3.298964
...

varImp(rfImp)
            b         g
V3  31.770511 31.770511
V4  19.768070 19.768070
V5  37.919596 37.919596
V6  22.099063 22.099063
V7  22.786656 22.786656
V8  22.153388 22.153388
V9   9.596522  9.596522
V10  9.668101  9.668101
V11  5.363359  5.363359
V12 10.998718 10.998718
...

I think I am missing something...

EDIT 2

I figured out that if you do the mean of each row of the first two columns of importance(rfImp), you get the results of varImp(rfImp):

impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
       V3        V4        V5        V6        V7        V8        V9 
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388  9.596522 
      V10       V11       V12 
 9.668101  5.363359 10.998718     ...

# Same result as in both columns of varImp(rfImp)

I do not know why this is happening, but there has to be an explanation for that.

Upvotes: 19

Views: 29571

Answers (4)

https://www.r-bloggers.com/variable-importance-plot-and-variable-selection/ In the given link, it has been shown that when you don't specify importance =TRUE in your model, you get the same mean decrease Gini value using randomForest and Caret package

Upvotes: 0

Johannes
Johannes

Reputation: 523

This answer is meant as an addition to the solution by @Shape. I think that importance follows the well-known approach by Breiman to calculate the variable importance reported as MeanDecreaseAccuracy, i.e. for the out-of-bag sample of each tree calculate the accuracy of the tree, then permute the variables one after the other and measure the accuracy after permutation to calculate the decrease in accuracy without that variable.
I have not been able to find much information on how exactly the class-specific accuracy decrease in the first columns are computed, but I assume it's correctly predicted class k / total predicted class k.

As @Shape explains, varImp does not report the MeanDecreaseAccuracy reported by importance, but instead calculates the mean of the (scaled) class-specific decreases in accuracy and reports it for each of the classes. (For more than 2 classes, varImp only reports the class-specific decrease in accuracy.)
This approach is similar only if the class distribution is equal. The reason is that only in the balanced case does a decrease in the accuracy of one class equally decrease the accuracy in the other class.

library(caret)
library(randomForest)
library(mlbench)

### Balanced sample size ###
data(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 1000, importance = TRUE)

# How importance() calculates the overall decerase in accuracy for the variable
Imp1 <- importance(rfImp1, scale = FALSE)
summary(Ionosphere$Class)/nrow(Ionosphere)
classRatio1 <- summary(Ionosphere$Class)/nrow(Ionosphere)
#      bad      good 
#0.3589744 0.6410256 

# Caret calculates a simple mean
varImp(rfImp1, scale = FALSE)["V3",] # 0.04542253
Imp1["V3", "bad"] * 0.5 + Imp1["V3", "good"] * 0.5 # 0.04542253
# importance is closer to the weighted average of class importances
Imp1["V3", ] # 0.05262225  
Imp1["V3", "bad"] * classRatio1[1] + Imp1["V3", "good"] * classRatio1[2] # 0.05274091

### Equal sample size ###
Ionosphere2 <- Ionosphere[c(which(Ionosphere$Class == "good"), sample(which(Ionosphere$Class == "bad"), 225, replace = TRUE)),]
summary(Ionosphere2$Class)/nrow(Ionosphere2)
classRatio2 <- summary(Ionosphere2$Class)/nrow(Ionosphere2)
#  bad good 
# 0.5  0.5

rfImp2 <- randomForest(Class ~ ., data = Ionosphere2[,3:35], ntree = 1000, importance = TRUE)
Imp2 <- importance(rfImp2, scale = FALSE)

# Caret calculates a simple mean
varImp(rfImp2, scale = FALSE)["V3",] # 0.06126641 
Imp2["V3", "bad"] * 0.5 + Imp2["V3", "good"] * 0.5 # 0.06126641 
# As does the average adjusted for the balanced class ratio
Imp2["V3", "bad"] * classRatio2[1] + Imp2["V3", "good"] * classRatio2[2] # 0.06126641 
# There is now not much difference between the measure for balanced classes
Imp2["V3",] # 0.06106229

I believe this can be interpreted as caret putting equal weight on all classes, while importance reports variables as more important if they are important for the more common class. I tend to agree with Max Kuhn on this, but the difference should be explained somewhere in the documentation.

Upvotes: 8

Shape
Shape

Reputation: 2952

If we walk through the method for varImp:

Check the object:

> getFromNamespace('varImp','caret')
function (object, ...) 
{
    UseMethod("varImp")
}

Get the S3 Method:

> getS3method('varImp','randomForest')
function (object, ...) 
{
    code <- varImpDependencies("rf")
    code$varImp(object, ...)
}
<environment: namespace:caret>


code <- caret:::varImpDependencies('rf')

> code$varImp
function(object, ...){
                    varImp <- randomForest::importance(object, ...)
                    if(object$type == "regression")
                      varImp <- data.frame(Overall = varImp[,"%IncMSE"])
                    else {
                      retainNames <- levels(object$y)
                      if(all(retainNames %in% colnames(varImp))) {
                        varImp <- varImp[, retainNames]
                      } else {
                        varImp <- data.frame(Overall = varImp[,1])
                      }
                    }

                    out <- as.data.frame(varImp)
                    if(dim(out)[2] == 2) {
                      tmp <- apply(out, 1, mean)
                      out[,1] <- out[,2] <- tmp  
                    }
                    out
                  }

So this is not strictly returning randomForest::importance,

It starts by calculating that but then selects only the categorical values that are in the dataset.

Then it does something interesting, it checks if we only have two columns:

if(dim(out)[2] == 2) {
   tmp <- apply(out, 1, mean)
   out[,1] <- out[,2] <- tmp  
}

According to the varImp man page:

Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.

This is clearly not the case.


As to why...

If we have only two values, the importance of the variable as a predictor can be represented as one value.

If the variable is a predictor of g, then it must also be a predictor of b

It does make sense, but this doesn't fit their documentation on what the function does, so I would likely report this as unexpected behavior. The function is attempting to assist when you're expecting to do the relative calculation yourself.

Upvotes: 19

geekoverdose
geekoverdose

Reputation: 1007

I don't have your exact data, but using dummy data (see below) I cannot reproduce this behaviour. Maybe double check that you really did nothing else that could affect your results. Which version of R and caret do you use?

library(caret)
library(randomForest)

# classification - same result
rfImp1 <- randomForest(Species ~ ., data = iris[,1:5],
                    ntree = 2000,
                    importance = TRUE)
importance(rfImp1)
varImp(rfImp1)

# regression - same result
rfImp2 <- randomForest(Sepal.Length ~ ., data = iris[,1:4],
                    ntree = 2000,
                    importance = TRUE)
importance(rfImp2)
varImp(rfImp2)

Update:

Using the Ionosphere data this is reproducible:

library(caret)
library(randomForest)
library(mlbench)
data(Ionosphere)
str(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 2000, importance = TRUE)

...with those results:

> head(importance(rfImp1))

         bad     good MeanDecreaseAccuracy MeanDecreaseGini
V3 20.545836 41.43872             41.26313        15.308791
V4 10.615291 29.31543             29.58395         6.226591
V5 29.508581 44.86784             46.79365        21.757928
V6  9.231544 31.77881             31.48614         7.201694
V7 12.461476 34.39334             34.92728        14.802564
V8 12.944721 32.49392             33.35699         6.971502

> head(varImp(rfImp1))

        bad     good
V3 30.99228 30.99228
V4 19.96536 19.96536
V5 37.18821 37.18821
V6 20.50518 20.50518
V7 23.42741 23.42741
V8 22.71932 22.71932

My guess would be that caret and randomForest just use different ways of aggregating results from different runs for each variable - but @topepo will most likely give you an exact answer anyway now.

Upvotes: 2

Related Questions