Betel
Betel

Reputation: 161

How to apply the trained & tested random forest model to a new dataset in tidymodels?

I have trained and tested a random forest model in R using tidymodels. Now i want to use the same model to predict a completely new dataset (not the training dataset).

For example Julia silge, had explained the steps to train, test and evaluate a model in this blog post : Juliasilge's palmer penguins. I wanted to apply this model on a completely new dataset with same columns (except the prediction column(here sex))

Can anyone help me with the code for predicting on a new dataset.

I can explain what i have tried with a sample dataset

library(palmerpenguins)

penguins <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

#Selecting the fitst 233 rows for training and testing

penguins_train_test<-penguins[1:233,]

#Splitting few other rows out of the parent data and assume that this is the new dataset which needs a prediction (not testing). Hence for this assumption, I had removed the column named "Sex", which needs to be predicted by fitting the model (not testing)

penguins_newdata<-penguins[233:333,-6]


set.seed(123)
penguin_split <- initial_split(penguins_train_test, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

creating the model specifications.

rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

penguin_wf <- workflow() %>%
  add_formula(sex ~ .)

Applying to the test data

penguin_final <- penguin_wf %>%
  add_model(rf_spec) %>%
  last_fit(penguin_split)

collect_metrics(penguin_final)

Similarly applying to the new dataset "penguins_newdata"

penguins_newdata

penguin_wf %>%
  add_model(rf_spec) %>%
  fit(penguins_newdata)

The result i got is the following error

Error: The following outcomes were not found in `data`: 'sex'.

I tried this way too

 fit(penguin_wf, penguins_newdata)

This is thee error i got

Error: The workflow must have a model. Provide one with `add_model()`.

Thank you in advance.

Upvotes: 2

Views: 1411

Answers (2)

Julia Silge
Julia Silge

Reputation: 11623

The answer above by @missuse looks great, but I just want to add a bit of clarifying info about which workflow is unfitted and which workflow is fitted. If you have new data with no outcome yet, you want to predict on it with a fitted workflow.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(palmerpenguins)

penguins <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

penguins_newdata <- penguins[233:333,-6]


set.seed(123)
penguin_split <- initial_split(penguins, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

unfitted_wf <- workflow() %>%
  add_formula(sex ~ .) %>%
  add_model(rf_spec)


penguin_final <- last_fit(unfitted_wf, penguin_split)

collect_metrics(penguin_final)
#> # A tibble: 2 x 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.940 Preprocessor1_Model1
#> 2 roc_auc  binary         0.983 Preprocessor1_Model1


# can predict on this fitted workflow
fitted_wf <- pluck(penguin_final$.workflow, 1)

predict(fitted_wf, new_data = penguins_newdata)
#> # A tibble: 101 x 1
#>    .pred_class
#>    <fct>      
#>  1 female     
#>  2 male       
#>  3 female     
#>  4 male       
#>  5 female     
#>  6 male       
#>  7 female     
#>  8 male       
#>  9 male       
#> 10 female     
#> # … with 91 more rows

Created on 2021-05-06 by the reprex package (v2.0.0)

I used variable names to hopefully make it extra clear which workflow is which. It's similar to a model, where you can specify a model but it can't be used for prediction until after you have fitted it to some training data.

Upvotes: 4

missuse
missuse

Reputation: 19716

The problem in your code is you are trying to fit the final model on new data which lacks the target variable sex this is what the error is telling you.

Error: The following outcomes were not found in `data`: 'sex'.

After all you workflow has the following line add_formula(sex ~ .) %>%

Packages

library(tidyverse)
library(palmerpenguins)
library(tidymodels)

preprocess and split on train and test data

penguins <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

penguins_train_test <- penguins[1:233,]
penguins_newdata <- penguins[233:333,-6]

define the workflow

rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")

penguin_wf <- workflow() %>%
  add_formula(sex ~ .) %>%
  add_model(rf_spec) %>%

fit model on train data using workflow

penguin_wf %>%      
  fit(penguins_train_test) -> model

use the model to predict on new data

predict(model, penguins_newdata)

output

# A tibble: 101 x 1
   .pred_class
   <fct>      
 1 female     
 2 male       
 3 male       
 4 male       
 5 female     
 6 male       
 7 female     
 8 male       
 9 male       
10 female     
# ... with 91 more rows

Here no tuning is performed and the model is made with default parameters. When you tune the hyper parameters via some sort of resampling which you are able to do as I gather from your question, you can extract them from the tune result best on a specific metric

param_final <- rf_tune_results %>%
  select_best(metric = "auc")

and set them in the workflow

rf_workflow <- rf_workflow %>%
  finalize_workflow(param_final)

that way when you fit the model on the train data the optimal hyper parameters will be used.

Additional details are given in the link I posted in the comment.

Upvotes: 4

Related Questions