Reputation: 754
Why are the variable importance plots different between tidymodels and caret when including interaction terms? I have demonstrated with the Ames housing data below. I used the same alpha/mixture and lambda/penalty in both models. The only difference between the models it the cross validation folds (I cannot figure out how to use tidymodel's folds with caret's train). Any ideas on why this is happening?
library(AmesHousing)
library(tidymodels)
library(caret)
library(vip)
df <- data.frame(ames_raw)
head(df)
# replace any missing observation with the mean
for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
# Create a data split object
set.seed(1994)
home_split <- initial_split(df,
prop = 0.7,
strata = SalePrice)
home_train <- home_split %>%
training()
home_test <- home_split %>%
testing()
# pre-process recipe
recipe_home <- recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
# model with hyperparameters
glmnet_model <- linear_reg(penalty = tune(), # lambda
mixture = tune()) %>% # alpha
set_engine('glmnet') %>%
set_mode('regression')
# model + recipe = workflow
wkfl <- workflow() %>%
add_model(glmnet_model) %>%
add_recipe(recipe_home)
# cv
set.seed(1994)
myfolds <- vfold_cv(home_train,
v = 10,
strata = SalePrice)
# grid search with cv
set.seed(1994)
glmnet_tuning <- wkfl %>%
tune_grid(resamples = myfolds,
grid = 25, # let the model find the best hyperparameters
metrics = metric_set(rmse))
glmnet_tuning
# select the best model
best_glmnet_model <- glmnet_tuning %>%
select_best(metric = 'rmse')
best_glmnet_model
# finalize the workflow
final_glmnet_wkfl <- wkfl %>%
finalize_workflow(best_glmnet_model)
# last_fit:
glmnet_final_fit <- final_glmnet_wkfl %>%
last_fit(split = home_split)
# extract the final model
final_glmnet <- extract_workflow(glmnet_final_fit)
# VIP final model
final_glmnet %>%
extract_fit_parsnip() %>%
vip(geom = "point", scale = TRUE)
set.seed(1994)
myGrid <- expand.grid(lambda = 0.00386,
alpha = 0.0874)
model_glmnet <- train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built
+ Lot.Area)^2,
data=home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(
method = "cv",
number = 10))
# variable importance
vip(model_glmnet, geom = "point", scale = TRUE)
Upvotes: 0
Views: 443
Reputation: 350
It looks like the two model specifications have very different features, which is why you're seeing different importance plots.
In your example, recipe_home
has one interactive term for a bunch of variables: Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area
.
recipe_home <-
recipe(SalePrice ~ Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area,
data = home_train) %>%
step_interact(terms = ~ Yr.Sold:Fireplaces:Full.Bath:Half.Bath:Year.Built:Lot.Area)
In your {glmnet}
model, you're creating a whole bunch of interactions between two variables by using ^2
in your formula (this gives a good definition of crossing).
model_glmnet <-
train(SalePrice ~ (Yr.Sold + Fireplaces + Full.Bath + Half.Bath + Year.Built + Lot.Area)^2,
data = home_train,
method = "glmnet",
tune_grid = myGrid,
metric = "RMSE",
maximize = FALSE,
trControl = trainControl(method = "cv", number = 10))
So in the second plot, you have a bunch of important features created by the ^2
interaction (ex, Fireplaces:Full.Bath
) that don't appear at all in the recipe_home
model.
Depending on which interactive terms you want, you should be able to have the models match up by either changing your formula for recipe_home
to remove the step_interact()
and add ^2
or change your formula in the glmnet model by removing the ^2
and adding that long interaction term.
Upvotes: 2