Kate
Kate

Reputation: 159

Conditional Variable Importance for Random Forests faster than in R?

I am working on a project to determine the variables that better predict the binary outcome. I am first fitting random forest and then calculating conditional variable importance to assess the importance of variables for my subgroup analysis. Training the random forest takes few minutes in R package party while calculating conditional variable importance takes hours if not days for a larger datasets.

To calculate conditional variable importance I used either

  1. party::varimp in R based on the paper https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307 or
  2. permimp::permimp in R based on the later paper from the same authors https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03622-2

Question: since the realization in R is so slow, is there any packages in say python which is faster than R? Or it is in the nature of this algorithm that it couldn't be implemented any faster?

UPD I believe that conditional option is the most time consuming, i.e conditional = TRUE is much slower than conditional = FALSE in the code below

# using cforest from the party package: 
library(party)  
# Fit the model 
cf <- cforest(Species ~ ., data = iris, 
              controls=cforest_unbiased(ntree=500, mtry=3))  
# Get variable importance 
varimp(cf, conditional = TRUE, nperm = 10)

Upvotes: 0

Views: 151

Answers (1)

rw2
rw2

Reputation: 1793

How long the analysis takes will vary with the size of the data, the number of predictors, the number of trees etc. But also, some R packages can be much more efficient than others. I would recommend trying some alternative random forest functions first, such as ranger, cforest or Rborist. Here are some simple examples:

# using the ranger package:
library(ranger)

# Fit the model
rf <- ranger(Species ~ ., data = iris, importance = 'permutation')

# Get variable importance
rf$variable.importance


# using cforest from the party package:
library(party)

# Fit the model
cf <- cforest(Species ~ ., data = iris, controls=cforest_unbiased(ntree=500, mtry=3))

# Get variable importance
varimp(cf)

Upvotes: 0

Related Questions