How to apply the trained & tested random forest model to a new dataset in tidymodels?

Question

I have trained and tested a random forest model in R using tidymodels. Now i want to use the same model to predict a completely new dataset (not the training dataset).

For example Julia silge, had explained the steps to train, test and evaluate a model in this blog post : Juliasilge's palmer penguins. I wanted to apply this model on a completely new dataset with same columns (except the prediction column(here sex))

Can anyone help me with the code for predicting on a new dataset.

I can explain what i have tried with a sample dataset

library(palmerpenguins)

penguins <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

#Selecting the fitst 233 rows for training and testing

penguins_train_test<-penguins[1:233,]

#Splitting few other rows out of the parent data and assume that this is the new dataset which needs a prediction (not testing). Hence for this assumption, I had removed the column named "Sex", which needs to be predicted by fitting the model (not testing)

penguins_newdata<-penguins[233:333,-6]


set.seed(123)
penguin_split <- initial_split(penguins_train_test, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

creating the model specifications.

rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

penguin_wf <- workflow() %>%
  add_formula(sex ~ .)

Applying to the test data

penguin_final <- penguin_wf %>%
  add_model(rf_spec) %>%
  last_fit(penguin_split)

collect_metrics(penguin_final)

Similarly applying to the new dataset "penguins_newdata"

penguins_newdata

penguin_wf %>%
  add_model(rf_spec) %>%
  fit(penguins_newdata)

The result i got is the following error

Error: The following outcomes were not found in `data`: 'sex'.

I tried this way too

 fit(penguin_wf, penguins_newdata)

This is thee error i got

Error: The workflow must have a model. Provide one with `add_model()`.

Thank you in advance.

Julia Silge · Accepted Answer

The answer above by @missuse looks great, but I just want to add a bit of clarifying info about which workflow is unfitted and which workflow is fitted. If you have new data with no outcome yet, you want to predict on it with a fitted workflow.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(palmerpenguins)

penguins <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

penguins_newdata <- penguins[233:333,-6]


set.seed(123)
penguin_split <- initial_split(penguins, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

unfitted_wf <- workflow() %>%
  add_formula(sex ~ .) %>%
  add_model(rf_spec)


penguin_final <- last_fit(unfitted_wf, penguin_split)

collect_metrics(penguin_final)
#> # A tibble: 2 x 4
#>   .metric  .estimator .estimate .config             
#>                                 
#> 1 accuracy binary         0.940 Preprocessor1_Model1
#> 2 roc_auc  binary         0.983 Preprocessor1_Model1


# can predict on this fitted workflow
fitted_wf <- pluck(penguin_final$.workflow, 1)

predict(fitted_wf, new_data = penguins_newdata)
#> # A tibble: 101 x 1
#>    .pred_class
#>          
#>  1 female     
#>  2 male       
#>  3 female     
#>  4 male       
#>  5 female     
#>  6 male       
#>  7 female     
#>  8 male       
#>  9 male       
#> 10 female     
#> # … with 91 more rows

^{Created on 2021-05-06 by the reprex package (v2.0.0)}

I used variable names to hopefully make it extra clear which workflow is which. It's similar to a model, where you can specify a model but it can't be used for prediction until after you have fitted it to some training data.

How to apply the trained & tested random forest model to a new dataset in tidymodels?

Answers (2)

Related Questions

How to apply the trained &amp; tested random forest model to a new dataset in tidymodels?

Answers (2)

Related Questions

How to apply the trained & tested random forest model to a new dataset in tidymodels?