LGe
LGe

Reputation: 516

How to prep a recipe, including tunable arguments?

As you can see from my code, I am trying to include feature selection into my tidymodels workflow. I am using some kaggle data, trying to predict customer churn.

In order to apply processing to test and training data, I am baking the recipe after I am using the the prep() function.

However, if I want to apply tuning for the step_select_roc() functions top_p argument, I do not know, how to prep() the recipe afterwards. Applying it as in my reprex, results in an error.

Maybe I have to adapt my workflow and separate some recipe tasks to get the job done. What is the best approach to achieve this?

#### LIBS

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(recipeselectors))


#### INPUT

# get dataset from: https://www.kaggle.com/shrutimechlearn/churn-modelling
data <- fread("Churn_Modelling.csv")


# split data
set.seed(seed = 1972) 
train_test_split <-
  rsample::initial_split(
    data = data,     
    prop = 0.80   
  ) 
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing() 


#### FEATURE ENGINEERING

# Define the recipe
recipe <- recipe(Exited ~ ., data = train_tbl) %>%
  step_rm(one_of("RowNumber", "Surname")) %>%
  update_role(CustomerId, new_role = "Helper") %>%
  step_num2factor(all_outcomes(),
                  levels = c("No", "Yes"),
                  transform = function(x) {x + 1}) %>%
  step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_corr(all_numeric(), -has_role("Helper")) %>%
  step_nzv(all_predictors()) %>%
  step_select_roc(all_predictors(), outcome = "Exited", top_p = tune()) %>%  
  prep()


# Bake it
train_baked <- recipe %>%  bake(train_tbl)
test_baked <- recipe %>% bake(test_tbl) 

Upvotes: 0

Views: 581

Answers (2)

LGe
LGe

Reputation: 516

Thanks to the help of Steven Pawley, I was able to integrate the tunable step_roc argument into my tidymodels model workflow. As Julia Silge mentioned, it is not possible to prep a recipe with tunable arguments. So if you still want to prep and bake your recipe, you can only do this as in the following example, after you have finalized your model and recipe:

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))
suppressPackageStartupMessages(library(finetune))

data(cells, package = "modeldata")

cells <- cells %>% select(-case)
set.seed(31)
split <- initial_split(cells, prop = 0.8)
train <- training(split)
test <- testing(split)

rec <-
    recipe(class ~ ., data = train) %>%
    step_corr(all_predictors(), threshold = 0.9) %>% 
    step_select_roc(all_predictors(), outcome = "class", top_p = tune())

# xgboost model
xgb_spec <- boost_tree(
    trees = tune(), 
    tree_depth = tune(), min_n = tune(), 
    loss_reduction = tune(),                    
    sample_size = tune(), mtry = tune(),         
    learn_rate = tune(),                        
    stop_iter = tune()
) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification")

# grid
xgb_grid <- grid_latin_hypercube(
    trees(),
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), train),
    learn_rate(),
    stop_iter(range = c(5L,50L)),
    size = 5
)

rec_grid <- grid_latin_hypercube(
    parameters(rec) %>% 
        update(top_p = top_p(c(0,30))) ,
    size = 5
)

comp_grid <- merge(xgb_grid, rec_grid)

model_metrics <- metric_set(roc_auc)  


rs <- vfold_cv(cells)

ctrl <- control_grid(pkgs = "recipeselectors")

cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
rfe_res <-
    xgb_spec %>% 
    tune_grid(
        preprocessor = rec,
        resamples = rs,
        grid = comp_grid,
        control = ctrl
    )
stopCluster(cl)


best <- rfe_res %>% select_best("roc_auc")

# finalize
final_mod <- finalize_model(xgb_spec, best)
final_rec <- finalize_recipe(rec, best)

# bakery
bake_test <- final_rec %>% prep() %>% bake(new_data = testing(split))
bake_train <- final_rec %>% prep() %>% bake(new_data = training(split))

Upvotes: 1

Julia Silge
Julia Silge

Reputation: 11613

You can't prep() a recipe that has tuneable arguments. Think of prep() as an analogy for fit() for a model; you can't fit a model if you haven't set the hyperparameters.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

rec <- recipe( ~ ., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = tune::tune())

prep(rec, training = USArrests)
#> Error in `prep()`:
#> ! You cannot `prep()` a tuneable recipe. Argument(s) with `tune()`: 'num_comp'. Do you want to use a tuning function such as `tune_grid()`?

Created on 2022-02-22 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions