Predict in workflow throws that column doesn't exist

Question

Given the following code

library(tidyverse)
library(lubridate)
library(tidymodels)
library(ranger)

df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/bitcoin.csv")

df <- df %>%
  mutate(Date = dmy(Date),
         Change_Percent = str_replace(Change_Percent, "%", ""),
         Change_Percent = as.double(Change_Percent)
         ) %>%
  filter(year(Date) > 2017)

int <- interval(ymd("2020-01-20"), 
                ymd("2022-01-15"))

df <- df %>%
  mutate(covid = ifelse(Date %within% int, T, F))

df %>%
  ggplot(aes(x = Date, y = Price, color = covid)) + 
    geom_line()

df <- df %>%
  arrange(Date) %>%
  mutate(lag1 = lag(Price),
         lag2 = lag(lag1),
         lag3 = lag(lag2),
         profit_next_day = lead(Profit))

# modelatge
df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day))

set.seed(42)
data_split <- initial_split(df_mod) # 3/4
train_data <- training(data_split)
test_data  <- testing(data_split)

bitcoin_rec <- 
  recipe(profit_next_day ~ ., data = train_data) %>%
  step_naomit(all_outcomes(), all_predictors()) %>%
  step_normalize(all_numeric_predictors())

bitcoin_prep <-
  prep(bitcoin_rec)

bitcoin_train <- juice(bitcoin_prep)
bitcoin_test  <- bake(bitcoin_prep, test_data)

rf_spec <- 
  rand_forest(trees = 200) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

bitcoin_wflow <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(bitcoin_prep)

bitcoin_fit <-
  bitcoin_wflow %>%
  fit(data = train_data)

final_model <- last_fit(bitcoin_wflow, data_split)

collect_metrics(final_model)

final_model %>%
  extract_workflow() %>%
  predict(test_data)

The last chunk of code that extracts the workflow and predicts the test_data is throwing the error:

Error in stop_subscript(): ! Can't subset columns that don't exist. x Column profit_next_day doesn't exist.

but profit_next_day exists already in test_data, as I checked multiple times, so I don't know what is happening. Never had this error before working with tidymodels.

Julia Silge · Accepted Answer

The problem here comes from using step_naomit() on the outcome. In general, steps that change rows (such as removing them) can be pretty tricky when it comes time to resample or predict on new data. You can read more in detail in our book, but I would suggest that you remove step_naomit() altogether from your recipe and change your earlier code to:

df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day)) %>%
  na.omit()

Predict in workflow throws that column doesn't exist

Answers (1)

Related Questions

Predict in workflow throws that column doesn&#39;t exist

Answers (1)

Related Questions

Predict in workflow throws that column doesn't exist