Reputation: 21459
Below is a self-contained code example.
I want to keep all the columns around so I can debug issues, so I don't just want to drop them.
I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.
Why is this happening?
Error:
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
Debug info during the run:
> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
$ splits:List of 2
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int 3
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold1"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int [1:2] 1 2
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold2"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
$ id : chr [1:2] "Fold1" "Fold2"
- attr(*, "v")= num 2
- attr(*, "repeats")= num 1
- attr(*, "breaks")= num 4
- attr(*, "pool")= num 0.1
- attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
There were issues with some computations A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information.
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
▆
1. ├─global train_lasso_model(recipe_obj, processed_data)
2. │ └─tune_results %>% select_best(metric = "roc_auc")
3. ├─tune::select_best(., metric = "roc_auc")
4. └─tune:::select_best.tune_results(., metric = "roc_auc")
5. ├─tune::show_best(...)
6. └─tune:::show_best.tune_results(...)
7. └─tune::.filter_perf_metrics(x, metric, eval_time)
8. └─tune::estimate_tune_results(x)
> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
name id gender
"factor" "factor" "factor"
> sapply(processed_data[, info_vars], class)
name id gender
"factor" "factor" "factor"
> class(processed_data)
[1] "tbl_df" "tbl" "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’
Code:
library(recipes)
library(workflows)
train_lasso_model <- function(recipe_obj, processed_data,
grid_size = 10, folds=2) {
# Create a logistic regression model specification with Lasso regularization
log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(log_reg_spec)
# Set up cross-validation
cv_folds <- vfold_cv(processed_data, v = folds)
str(cv_folds)
# Tune the model to find the best regularization strength (penalty)
message("before tune_grid")
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
message("after tune_grid")
# Check the best tuning parameters (lambda)
best_lambda <- tune_results %>%
select_best(metric = "roc_auc")
# Finalize the workflow with the best penalty
message("before finalize_workflow")
final_workflow <- workflow_obj %>%
finalize_workflow(best_lambda)
message("after finalize_workflow")
# Fit the final model
final_model <- fit(final_workflow, data = processed_data)
# Return the trained model
return(final_model)
}
test_data <- data.frame(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
target = c(4, 5, 6)
)
info_vars <- c("name", "id",
# mark gender as informational, but still make it a dummy var
"gender")
recipe_obj <- recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(
all_of(info_vars),
new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
step_unknown(all_nominal(), skip = TRUE) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
gender,
keep_original_cols = TRUE)
prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)
final_model <- train_lasso_model(recipe_obj, processed_data)
train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")
NOTE: I get the same error if I comment out step_unknown and step_dummy.
Upvotes: 0
Views: 35
Reputation: 14331
Thanks for the additional details and the reproducible example. There are two ways to go about this. First, you could make a copy of gender
and give it a non-predictor role. Alternatively, you can keep the original column (as you do) and remove it from the model when the workflow is shown. I took the latter approach.
Some small comments:
Here’s the code that I would use:
library(tidymodels)
tidymodels_prefer()
test_data <- expand.grid(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
# NOTE: Don't use `c(4, 5, 6)` as the outcome for logistic reg
target = c("yes", "no")
) %>%
# NOTE:Make the categorical data into factors
mutate(across(where(is.character), ~ as.factor(.x)))
recipe_obj <-
recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(name, id, new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
# NOTE: capture all character/factor _predictors_ for transformations
step_unknown(all_nominal_predictors()) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal_predictors(), keep_original_cols = TRUE)
log_reg_spec <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
# NOTE: specify what the model gets post-recipe.
# See https://www.tmwr.org/workflows#special-model-formulas
# This removes gender from being in the model (but uses its indicators)
# but keeps the gender column in the data.
add_model(log_reg_spec, formula = target ~ -gender + .)
grid_size <- 10
num_folds <- 2
cv_folds <- vfold_cv(test_data, v = num_folds)
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
tune_results
#> # Tuning results
#> # 2-fold cross-validation
#> # A tibble: 2 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [27/27]> Fold1 <tibble [30 × 5]> <tibble [0 × 3]>
#> 2 <split [27/27]> Fold2 <tibble [30 × 5]> <tibble [0 × 3]>
Created on 2025-02-04 with reprex v2.1.0
Upvotes: 0