Reputation: 11
I have built a random forest tidy model very similar to what Julia Silge has done in this video. I also plan to show variable importance plots based on the permutation method, however I would like to show box plots or violin plots, rather than points.
Here is an example, following Julia's code:
Data and Model Building
# DATA
library(tidyverse)
water_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-04/water.csv")
# Data prep
water <- water_raw %>%
filter(
country_name == "Sierra Leone",
lat_deg > 0, lat_deg < 15, lon_deg < 0,
status_id %in% c("y", "n")
) %>%
mutate(pay = case_when(
str_detect(pay, "^No") ~ "no",
str_detect(pay, "^Yes") ~ "yes",
is.na(pay) ~ pay,
TRUE ~ "it's complicated"
)) %>%
select(-country_name, -status, -report_date) %>%
mutate_if(is.character, as.factor)
library(tidymodels)
set.seed(123)
water_split <- initial_split(water, strata = status_id)
water_train <- training(water_split)
water_test <- testing(water_split)
set.seed(234)
water_folds <- vfold_cv(water_train, strata = status_id)
water_folds
# Model building
library(themis)
ranger_recipe <-
recipe(formula = status_id ~ ., data = water_train) %>%
update_role(row_id, new_role = "id") %>%
step_unknown(all_nominal_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.03) %>%
step_impute_linear(install_year) %>%
step_downsample(status_id)
ranger_spec <-
rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
doParallel::registerDoParallel()
set.seed(74403)
ranger_rs <-
fit_resamples(ranger_workflow,
resamples = water_folds,
control = control_resamples(save_pred = TRUE)
)
Here is Julia's VIP code:
library(vip)
imp_data <- ranger_recipe %>%
prep() %>%
bake(new_data = NULL) %>%
select(-row_id)
ranger_spec %>%
set_engine("ranger", importance = "permutation") %>%
fit(status_id ~ ., data = imp_data) %>%
vip(geom = "point")
My Attempt:
ranger_spec %>%
set_engine("ranger", importance = "permutation") %>%
fit(status_id ~ ., data = imp_data) %>%
vip(pred_wrapper = predict, geom = "boxplot", nsim = 10, keep = TRUE)
However it continues to return this error:
Error: To construct boxplots for permutation-based importance scores you must specify keep = TRUE
in the call vi()
or vi_permute()
. Additionally, you also need to set nsim >= 2
.
Because I have done all of those things, I assume my error is with pred_wrapper, but I'm not sure. What am I doing wrong here?
Thanks ya'll!
Upvotes: 1
Views: 442
Reputation: 11613
First, you may be interested in a resampling approach to estimating variable importance, where you yourself control the resampling and what gets extracted.
Second, I think something is not working quite right with method = "permutation"
for a tidymodels model. I can't get it to work, but I can get the permutation importance for the underlying model:
library(vip)
imp_data <- ranger_recipe %>%
prep() %>%
bake(new_data = NULL) %>%
select(-row_id)
mod <- ranger::ranger(status_id ~ ., data = imp_data, classification = TRUE)
pred_fun = function(object, newdata) {
predict(object, newdata)$predictions
}
vip(mod, method = "permute",
train = imp_data, target = "status_id",
metric = "accuracy", pred_wrapper = pred_fun)
Created on 2022-09-02 with reprex v2.0.2
Here is another resource for how to use vip, but you may want to look into using DALEX for permutation variable importance.
Upvotes: 2