Norhther
Norhther

Reputation: 500

Predict in workflow throws that column doesn't exist

Given the following code

library(tidyverse)
library(lubridate)
library(tidymodels)
library(ranger)

df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/bitcoin.csv")

df <- df %>%
  mutate(Date = dmy(Date),
         Change_Percent = str_replace(Change_Percent, "%", ""),
         Change_Percent = as.double(Change_Percent)
         ) %>%
  filter(year(Date) > 2017)

int <- interval(ymd("2020-01-20"), 
                ymd("2022-01-15"))

df <- df %>%
  mutate(covid = ifelse(Date %within% int, T, F))

df %>%
  ggplot(aes(x = Date, y = Price, color = covid)) + 
    geom_line()

df <- df %>%
  arrange(Date) %>%
  mutate(lag1 = lag(Price),
         lag2 = lag(lag1),
         lag3 = lag(lag2),
         profit_next_day = lead(Profit))

# modelatge
df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day))

set.seed(42)
data_split <- initial_split(df_mod) # 3/4
train_data <- training(data_split)
test_data  <- testing(data_split)

bitcoin_rec <- 
  recipe(profit_next_day ~ ., data = train_data) %>%
  step_naomit(all_outcomes(), all_predictors()) %>%
  step_normalize(all_numeric_predictors())

bitcoin_prep <-
  prep(bitcoin_rec)

bitcoin_train <- juice(bitcoin_prep)
bitcoin_test  <- bake(bitcoin_prep, test_data)

rf_spec <- 
  rand_forest(trees = 200) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

bitcoin_wflow <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(bitcoin_prep)

bitcoin_fit <-
  bitcoin_wflow %>%
  fit(data = train_data)

final_model <- last_fit(bitcoin_wflow, data_split)

collect_metrics(final_model)

final_model %>%
  extract_workflow() %>%
  predict(test_data)

The last chunk of code that extracts the workflow and predicts the test_data is throwing the error:

Error in stop_subscript(): ! Can't subset columns that don't exist. x Column profit_next_day doesn't exist.

but profit_next_day exists already in test_data, as I checked multiple times, so I don't know what is happening. Never had this error before working with tidymodels.

Upvotes: 1

Views: 496

Answers (1)

Julia Silge
Julia Silge

Reputation: 11663

The problem here comes from using step_naomit() on the outcome. In general, steps that change rows (such as removing them) can be pretty tricky when it comes time to resample or predict on new data. You can read more in detail in our book, but I would suggest that you remove step_naomit() altogether from your recipe and change your earlier code to:

df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day)) %>%
  na.omit()

Upvotes: 2

Related Questions