Why does tune_grid find character variables instead of factors?

Question

Below is a self-contained code example.

I have test_data with character columns (name, id, gender)
I convert them all to factors
I mark name and id as "informational" (i.e. not to be used model building)
When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.

I want to keep all the columns around so I can debug issues, so I don't just want to drop them.

I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.

Why is this happening?

Error:

→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class , not .
               • `id` must have class , not .
               • `gender` must have class , not .

Debug info during the run:

> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
 $ splits:List of 2
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int 3
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold1"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int [1:2] 1 2
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold2"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
 $ id    : chr [1:2] "Fold1" "Fold2"
 - attr(*, "v")= num 2
 - attr(*, "repeats")= num 1
 - attr(*, "breaks")= num 4
 - attr(*, "pool")= num 0.1
 - attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class , not .
               • `id` must have class , not .
               • `gender` must have class , not .
There were issues with some computations   A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information. 
> rlang::last_trace()

Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
    ▆
 1. ├─global train_lasso_model(recipe_obj, processed_data)
 2. │ └─tune_results %>% select_best(metric = "roc_auc")
 3. ├─tune::select_best(., metric = "roc_auc")
 4. └─tune:::select_best.tune_results(., metric = "roc_auc")
 5.   ├─tune::show_best(...)
 6.   └─tune:::show_best.tune_results(...)
 7.     └─tune::.filter_perf_metrics(x, metric, eval_time)
 8.       └─tune::estimate_tune_results(x)

> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> sapply(processed_data[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> class(processed_data)
[1] "tbl_df"     "tbl"        "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’

Code:

library(recipes)
library(workflows)

train_lasso_model <- function(recipe_obj, processed_data,
                              grid_size = 10, folds=2) {
    # Create a logistic regression model specification with Lasso regularization
  log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
    set_engine("glmnet")

  # Create a workflow
  workflow_obj <- workflow() %>%
    add_recipe(recipe_obj) %>%
    add_model(log_reg_spec)

  # Set up cross-validation
  cv_folds <- vfold_cv(processed_data, v = folds)
  str(cv_folds)

  # Tune the model to find the best regularization strength (penalty)
  message("before tune_grid")
  tune_results <- workflow_obj %>%
    tune_grid(resamples = cv_folds, grid = grid_size)
  message("after tune_grid")

  # Check the best tuning parameters (lambda)
  best_lambda <- tune_results %>%
    select_best(metric = "roc_auc")

  # Finalize the workflow with the best penalty
  message("before finalize_workflow")
  final_workflow <- workflow_obj %>%
    finalize_workflow(best_lambda)
  message("after finalize_workflow")

  # Fit the final model
  final_model <- fit(final_workflow, data = processed_data)

  # Return the trained model
  return(final_model)
}

test_data <- data.frame(
  name = c("sam", "sam", "sam"),
  id = c("1", "2", "3"),
  gender = c("male", "female", "female"),
  target = c(4, 5, 6)
)

info_vars <- c("name", "id",
        # mark gender as informational, but still make it a dummy var
        "gender")

recipe_obj <- recipe(target ~ ., data = test_data) %>%
  # mark vars as not used in the model
  update_role(
    all_of(info_vars),
    new_role = "informational") %>%
  # Create an "unknown" category for all unknown factor levels
  step_unknown(all_nominal(), skip = TRUE) %>%
  # Convert factors/character columns to dummies
  step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
             gender,
             keep_original_cols = TRUE)

prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)

final_model <- train_lasso_model(recipe_obj, processed_data)

train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")

NOTE: I get the same error if I comment out step_unknown and step_dummy.

Why does tune_grid find character variables instead of factors?

Answers (1)

Related Questions