mapleleaf
mapleleaf

Reputation: 754

Variable Importance Tidymodels versus Caret with Interactions

Why are the variable importance plots different between tidymodels and caret when including interaction terms? I have demonstrated with the Ames housing data below. I used the same alpha/mixture and lambda/penalty in both models. The only difference between the models it the cross validation folds (I cannot figure out how to use tidymodel's folds with caret's train). Any ideas on why this is happening?

library(AmesHousing)
library(tidymodels)
library(caret)
library(vip)

df <- data.frame(ames_raw)
head(df)



# replace any missing observation with the mean

for(i in 1:ncol(df)){

  df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)

}



# Create a data split object

set.seed(1994)
home_split <- initial_split(df,
                        
                        prop = 0.7,
                        
                        strata = SalePrice)



home_train <- home_split %>%

  training()



home_test <- home_split %>%

  testing()


# pre-process recipe

recipe_home <- recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built +     Lot.Area,
                  
                  data = home_train) %>%

  step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)



# model with hyperparameters

glmnet_model <- linear_reg(penalty = tune(), # lambda
                      
                       mixture = tune()) %>% # alpha

  set_engine('glmnet') %>%

  set_mode('regression')



# model + recipe = workflow

wkfl <- workflow() %>%

  add_model(glmnet_model) %>%

  add_recipe(recipe_home)



# cv

set.seed(1994)

myfolds <- vfold_cv(home_train,
                
                v = 10,
                
                strata = SalePrice)



# grid search with cv

set.seed(1994)

glmnet_tuning <- wkfl %>%

  tune_grid(resamples = myfolds,
        
        grid = 25, # let the model find the best hyperparameters
        
        metrics = metric_set(rmse))



glmnet_tuning





# select the best model

best_glmnet_model <- glmnet_tuning %>%

  select_best(metric = 'rmse')

best_glmnet_model


# finalize the workflow

final_glmnet_wkfl <- wkfl %>%

  finalize_workflow(best_glmnet_model)



# last_fit:


glmnet_final_fit <- final_glmnet_wkfl %>%

  last_fit(split = home_split)



# extract the final model

final_glmnet <- extract_workflow(glmnet_final_fit)


# VIP final model

final_glmnet %>%

  extract_fit_parsnip() %>%

  vip(geom = "point", scale = TRUE)

enter image description here

set.seed(1994)

myGrid <- expand.grid(lambda = 0.00386,
                  alpha = 0.0874)

 model_glmnet <- train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built 
                                   + Lot.Area)^2,
                  
                  data=home_train,
                  
                  method = "glmnet",
                  
                  tune_grid = myGrid,
                  
                  metric = "RMSE",
                  
                  maximize = FALSE,
                  
                  trControl = trainControl(
                    
                    method = "cv",
                    
                    number = 10))





# variable importance

vip(model_glmnet, geom = "point", scale = TRUE)

enter image description here

Upvotes: 0

Views: 443

Answers (1)

Mark Rieke
Mark Rieke

Reputation: 350

It looks like the two model specifications have very different features, which is why you're seeing different importance plots.

In your example, recipe_home has one interactive term for a bunch of variables: Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area.

recipe_home <- 
  recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
         data = home_train) %>%
  step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)

In your {glmnet} model, you're creating a whole bunch of interactions between two variables by using ^2 in your formula (this gives a good definition of crossing).

model_glmnet <- 
  train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area)^2,
        data = home_train,
        method = "glmnet",
        tune_grid = myGrid,
        metric = "RMSE",
        maximize = FALSE,
        trControl = trainControl(method = "cv", number = 10))

So in the second plot, you have a bunch of important features created by the ^2 interaction (ex, Fireplaces:Full.Bath) that don't appear at all in the recipe_home model.

Depending on which interactive terms you want, you should be able to have the models match up by either changing your formula for recipe_home to remove the step_interact() and add ^2 or change your formula in the glmnet model by removing the ^2 and adding that long interaction term.

Upvotes: 2

Related Questions