Reputation: 1015
I have two datasets, a training and test dataset, and I am creating an SVM using the training dataset, with the tidymodels package on R. As part of the SVM workflow, I am doing feature selection to chose the 5 best performing features. I am then trying to test this SVM using the test dataset. However, I am getting a "The following required columns are missing"
error when I am trying to predict classifications of the test dataset, despite the variables in the test dataset matching the model predictors.
Note that I do the feature selection using step_select_roc, with top_p selecting the 5 best performing features. I have created a replicable example:
library(tidymodels)
#remotes::install_github("stevenpawley/recipesSelection")
library(recipeselectors)
library(mlbench)
data(Ionosphere)
# preprocess dataset
Ionosphere <- Ionosphere %>% select(-V1, -V2)
# split into training and test data
ion_split <- initial_split(Ionosphere, prop = 3/5)
ion_train <- training(ion_split)
ion_test <- testing(ion_split)
# make a recipe - note the step_select_roc function, which will select the 5
iono_rec <-
recipe(Class ~ ., data = ion_train) %>%
step_zv(all_predictors()) %>%
step_lincomb(all_numeric()) %>%
step_select_roc(all_predictors(), outcome = "Class", top_p = 5)
# build the model and workflow
svm_mod <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_workflow <-
workflow() %>%
add_recipe(iono_rec) %>%
add_model(svm_mod)
# run model tuning
set.seed(35)
recipe_res <-
svm_workflow %>%
tune_grid(
resamples = bootstraps(ion_train, times = 2),
metrics = metric_set(roc_auc),
control = control_grid(verbose = TRUE, save_pred = TRUE)
)
# chose best model, finalise workflow
best_mod <- recipe_res %>% select_best("roc_auc")
final_wf <- finalize_workflow(svm_workflow, best_mod)
final_mod <- final_wf %>% fit(ion_train)
At this stage, I can do pull_workflow_mold
to see that there are only 5 predictor variables:
pull_workflow_mold(final_mod)$predictor
# A tibble: 211 x 5
V3 V7 V27 V31 V33
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.995 0.834 0.411 0.423 0.186
2 1 -0.109 -0.205 -0.166 -0.137
3 1 1 0.590 0.604 0.560
4 0.976 0.928 0.137 -0.0426 -0.138
5 0.964 1 0.576 0.451 0.389
6 -0.0186 0 0.206 0.166 -0.0821
7 1 1 1 1 1
8 1 1.00 0.762 0.687 0.647
9 1 0.855 1 1 1
10 1 1 1 1 1
# … with 201 more rows
Now if I subset my test data to only those predictors in the model, and then try and use predict, I get an error:
ion_test <- testing(ion_split) %>% select(V3, V7, V27, V31, V33)
predict_res <- predict(
final_mod,
ion_test,
type = "prob")
Error: The following required columns are missing: 'V4', 'V5', 'V6', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V28', 'V29', 'V30', 'V32', 'V34'.
Can someone please advise why this problem is happening, and how to avoid it? Thank you.
Upvotes: 1
Views: 1294
Reputation: 707
If you use tidymodels to fit and predict data, you need to provide the same variables in new_data
as were used for model training.
This should fix your issue:
ion_test <- testing(ion_split) ## %>% select(V3, V7, V27, V31, V33) # don't select here!
predict_res <- predict(
final_mod,
new_data = ion_test,
type = "prob")
predict_res
# A tibble: 141 × 2
.pred_bad .pred_good
<dbl> <dbl>
1 0.0217 0.978
2 0.908 0.0917
3 0.961 0.0391
4 0.0341 0.966
5 0.0641 0.936
6 0.957 0.0428
7 0.0321 0.968
8 0.958 0.0424
9 0.291 0.709
10 0.0480 0.952
# … with 131 more rows
Alternatively, you might want to repeat the fitting procedure using only the five selected variables in the recipe, and then predict the new data with the same variables selected. However, I feel that this goes a bit against the tidy philophophy of tidymodels, although it will give you a smaller object to save on disk.
Also, note that I got a warning about deprecated use of pull_*
functions in your original code. I replaced
pull_workflow_mold(final_mod)$predictor
with
extract_mold(final_mod)$predictor
# A tibble: 210 × 5
V3 V4 V5 V7 V27
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.724 -0.0108 0.797 0.8 0.780
2 0.599 0.147 0.699 0.851 0.614
3 0.495 0.0971 0.296 0.350 0.365
4 0 0 0 0 1
5 0.947 0.287 0.726 0.476 0.161
6 0.923 0.0780 0.927 0.897 0.188
7 0.675 0.0453 0.770 0.774 0.739
8 1 -0.0373 1 0.996 0.832
9 0.749 0.0255 0.990 0.759 0.823
10 0.882 -0.146 0.934 0.921 0.568
# … with 200 more rows
Also note that I got different chosen predictors.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kernlab_0.9-29 vctrs_0.3.8 rlang_0.4.11
[4] recipeselectors_0.0.1 mlbench_2.1-3 yardstick_0.0.8
[7] workflowsets_0.1.0 workflows_0.2.3 tune_0.1.6
[10] tidyr_1.1.3 tibble_3.1.4 rsample_0.1.0
[13] recipes_0.1.16 purrr_0.3.4 parsnip_0.1.7
[16] modeldata_0.1.1 infer_1.0.0 dplyr_1.0.7
[19] dials_0.0.10 scales_1.1.1 broom_0.7.9
[22] tidymodels_0.1.3 ggplot2_3.3.5
Upvotes: 2