S Front
S Front

Reputation: 353

DALEX and step_pca

I would like to look at the compound feature importance of the principal components with DALEX model_parts but I am also interested to what extent the results are driven by variation in a specific variable in this principal component. I can look at individual feature influence very neatly when using model_profile but in that case, I cannot investigate the feature importance of the PCA variables. Is it possible to get the best of both world and look at the compound feature importance of a principal component while using model_profile partial dependence plots of individual factors as shown below?

Data:

library(tidymodels)
library(parsnip)
library(DALEXtra)

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
# id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
df[, c("x1", "x2", "x3", "x4", "id")] <- sapply(df[, c("x1", "x2", "x3", "x4", "id")], as.numeric)

Model

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
  update_role(id, new_role = "id variable") %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(x1, x2, x3, threshold = 0.9, num_comp = turn_off_pca)

# parsnip engine
boost_model <- boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

# create wf
boosted_wf <- 
  workflow() %>% 
  add_model(boost_model) %>% 
  add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 

# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted, 
                                              data = df[,-1], 
                                              y = df$y) 

# feature importance
model_parts(explainer_xgb) %>% plot()

This gives me the following plot although even if I have reduced x1, x2 and x3 into one component in step_pca above.

enter image description here

I know I could reduce dimensions manually and bind it to the df like so and then look at the feature importance.

rec_pca_2 <- df %>% 
  select(x1, x2, x3) %>% 
  recipe() %>%
  step_pca(all_numeric(), num_comp = 1)


df <- bind_cols(df, prep(rec_pca_2) %>% juice())
df

> df
# A tibble: 1,000 × 6
   y        x1    x2    x3    x4   PC1
   <fct> <int> <int> <int> <int> <dbl>
 1 2         0     2     4     2 -4.45
 2 3         0     3     3     3 -3.95
 3 0         0     2     4     4 -4.45
 4 2         1     4     5     3 -6.27
 5 4         0     1     5     2 -4.94
 6 2         1     0     5     1 -4.63
 7 3         2     2     5     4 -5.56
 8 3         1     2     5     3 -5.45
 9 2         1     3     5     2 -5.86
10 2         0     2     5     1 -5.35
# … with 990 more rows

I could then estimate a model with PC1 as covariate. Yet, in that case, it would be difficult to interpret what the variation in PC1 substatial means when using model_profile since everything would be collapsed into one component.

model_profile(explainer_xgb) %>% plot()

enter image description here

Thus, my key question is: how can I look at the feature importance of components without compromising on the interpretability of the partial dependence plot?

Upvotes: 1

Views: 202

Answers (1)

Julia Silge
Julia Silge

Reputation: 11623

You may be interested in the discussion here on how to get explainability from the original predictors vs. features that have been created via feature engineering (like PCA components). We don't have a super fluent interface yet, so you have to do this is a bit manually:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(parsnip)
library(DALEX)
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain

set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
y <- as.factor(sample(c("yes", "no"), size = 1000, replace = TRUE))
df <- tibble(y, x1, x2, x3, x4) %>% mutate(across(where(is.integer), as.numeric))

# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)

# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    step_pca(x1, x2, x3, threshold = 0.9)

# parsnip engine
boost_model <- boost_tree() %>% 
    set_mode("classification") %>% 
    set_engine("xgboost")

# create wf
boosted_wf <- 
    workflow() %>% 
    add_model(boost_model) %>% 
    add_recipe(rec_pca)

final_boosted <- generics::fit(boosted_wf, df) 
#> [14:00:11] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Notice that next here I use regular DALEX (not DALEXtra), and that I manually extract out the xgboost model from inside the workflow and apply the feature engineering to the data myself:

# create an explanation object
explainer_xgb <-
    DALEX::explain(
        extract_fit_parsnip(final_boosted), 
        data = rec_pca %>% prep() %>% bake(new_data = NULL, all_predictors()), 
        y = as.integer(train$y)
    ) 
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_fit  (  default  )
#>   -> data              :  800  rows  4  cols 
#>   -> data              :  tibble converted into a data.frame 
#>   -> target variable   :  800  values 
#>   -> predict function  :  yhat.model_fit  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package parsnip , ver. 0.1.7 , task classification (  default  ) 
#>   -> predicted values  :  numerical, min =  0.1157353 , mean =  0.4626758 , max =  0.8343955  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  0.1860582 , mean =  0.9985742 , max =  1.884265  
#>   A new explainer has been created!


model_parts(explainer_xgb) %>% plot()

Created on 2022-03-11 by the reprex package (v2.0.1)

The only behavior supported right now in DALEXtra is based on using the original predictors so if you want to look at those engineered features, you need to do it yourself. You may be interested in this chapter of our book.

Upvotes: 1

Related Questions