How to make predictions in tidymodels R when feature selection has been applied to the model

Question

I have two datasets, a training and test dataset, and I am creating an SVM using the training dataset, with the tidymodels package on R. As part of the SVM workflow, I am doing feature selection to chose the 5 best performing features. I am then trying to test this SVM using the test dataset. However, I am getting a "The following required columns are missing" error when I am trying to predict classifications of the test dataset, despite the variables in the test dataset matching the model predictors.

Note that I do the feature selection using step_select_roc, with top_p selecting the 5 best performing features. I have created a replicable example:

library(tidymodels)
#remotes::install_github("stevenpawley/recipesSelection")
library(recipeselectors)

library(mlbench)
data(Ionosphere)

# preprocess dataset
Ionosphere <- Ionosphere %>% select(-V1, -V2)

# split into training and test data
ion_split <- initial_split(Ionosphere, prop = 3/5)

ion_train <- training(ion_split)
ion_test <- testing(ion_split) 

# make a recipe - note the step_select_roc function, which will select the 5 
iono_rec <-
  recipe(Class ~ ., data = ion_train)  %>%
  step_zv(all_predictors()) %>% 
  step_lincomb(all_numeric()) %>%
  step_select_roc(all_predictors(), outcome = "Class", top_p = 5)

# build the model and workflow
svm_mod <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_workflow <- 
      workflow() %>%
      add_recipe(iono_rec) %>%
      add_model(svm_mod)

# run model tuning
set.seed(35)
recipe_res <-
  svm_workflow %>% 
  tune_grid(
    resamples = bootstraps(ion_train, times = 2),
    metrics = metric_set(roc_auc),
    control = control_grid(verbose = TRUE, save_pred = TRUE)
  )

# chose best model, finalise workflow
best_mod <- recipe_res %>% select_best("roc_auc")
final_wf <- finalize_workflow(svm_workflow, best_mod)
final_mod <- final_wf %>% fit(ion_train)

At this stage, I can do pull_workflow_mold to see that there are only 5 predictor variables:

pull_workflow_mold(final_mod)$predictor
# A tibble: 211 x 5
        V3     V7    V27     V31     V33
               
 1  0.995   0.834  0.411  0.423   0.186 
 2  1      -0.109 -0.205 -0.166  -0.137 
 3  1       1      0.590  0.604   0.560 
 4  0.976   0.928  0.137 -0.0426 -0.138 
 5  0.964   1      0.576  0.451   0.389 
 6 -0.0186  0      0.206  0.166  -0.0821
 7  1       1      1      1       1     
 8  1       1.00   0.762  0.687   0.647 
 9  1       0.855  1      1       1     
10  1       1      1      1       1     
# … with 201 more rows

Now if I subset my test data to only those predictors in the model, and then try and use predict, I get an error:

ion_test <- testing(ion_split) %>% select(V3, V7, V27, V31, V33)

predict_res <- predict(
        final_mod,
        ion_test,
        type = "prob")
    
Error: The following required columns are missing: 'V4', 'V5', 'V6', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V28', 'V29', 'V30', 'V32', 'V34'.

Can someone please advise why this problem is happening, and how to avoid it? Thank you.

How to make predictions in tidymodels R when feature selection has been applied to the model

Answers (1)

Related Questions