Reputation: 161
I have trained and tested a random forest model in R using tidymodels. Now i want to use the same model to predict a completely new dataset (not the training dataset).
For example Julia silge, had explained the steps to train, test and evaluate a model in this blog post : Juliasilge's palmer penguins. I wanted to apply this model on a completely new dataset with same columns (except the prediction column(here sex))
Can anyone help me with the code for predicting on a new dataset.
I can explain what i have tried with a sample dataset
library(palmerpenguins)
penguins <- penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
#Selecting the fitst 233 rows for training and testing
penguins_train_test<-penguins[1:233,]
#Splitting few other rows out of the parent data and assume that this is the new dataset which needs a prediction (not testing). Hence for this assumption, I had removed the column named "Sex", which needs to be predicted by fitting the model (not testing)
penguins_newdata<-penguins[233:333,-6]
set.seed(123)
penguin_split <- initial_split(penguins_train_test, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
creating the model specifications.
rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
penguin_wf <- workflow() %>%
add_formula(sex ~ .)
Applying to the test data
penguin_final <- penguin_wf %>%
add_model(rf_spec) %>%
last_fit(penguin_split)
collect_metrics(penguin_final)
Similarly applying to the new dataset "penguins_newdata"
penguins_newdata
penguin_wf %>%
add_model(rf_spec) %>%
fit(penguins_newdata)
The result i got is the following error
Error: The following outcomes were not found in `data`: 'sex'.
I tried this way too
fit(penguin_wf, penguins_newdata)
This is thee error i got
Error: The workflow must have a model. Provide one with `add_model()`.
Thank you in advance.
Upvotes: 2
Views: 1411
Reputation: 11623
The answer above by @missuse looks great, but I just want to add a bit of clarifying info about which workflow is unfitted and which workflow is fitted. If you have new data with no outcome yet, you want to predict on it with a fitted workflow.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(palmerpenguins)
penguins <- penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
penguins_newdata <- penguins[233:333,-6]
set.seed(123)
penguin_split <- initial_split(penguins, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
unfitted_wf <- workflow() %>%
add_formula(sex ~ .) %>%
add_model(rf_spec)
penguin_final <- last_fit(unfitted_wf, penguin_split)
collect_metrics(penguin_final)
#> # A tibble: 2 x 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 accuracy binary 0.940 Preprocessor1_Model1
#> 2 roc_auc binary 0.983 Preprocessor1_Model1
# can predict on this fitted workflow
fitted_wf <- pluck(penguin_final$.workflow, 1)
predict(fitted_wf, new_data = penguins_newdata)
#> # A tibble: 101 x 1
#> .pred_class
#> <fct>
#> 1 female
#> 2 male
#> 3 female
#> 4 male
#> 5 female
#> 6 male
#> 7 female
#> 8 male
#> 9 male
#> 10 female
#> # … with 91 more rows
Created on 2021-05-06 by the reprex package (v2.0.0)
I used variable names to hopefully make it extra clear which workflow is which. It's similar to a model, where you can specify a model but it can't be used for prediction until after you have fitted it to some training data.
Upvotes: 4
Reputation: 19716
The problem in your code is you are trying to fit the final model on new data which lacks the target variable sex
this is what the error is telling you.
Error: The following outcomes were not found in `data`: 'sex'.
After all you workflow has the following line add_formula(sex ~ .) %>%
Packages
library(tidyverse)
library(palmerpenguins)
library(tidymodels)
preprocess and split on train and test data
penguins <- penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
penguins_train_test <- penguins[1:233,]
penguins_newdata <- penguins[233:333,-6]
define the workflow
rf_spec <- rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
penguin_wf <- workflow() %>%
add_formula(sex ~ .) %>%
add_model(rf_spec) %>%
fit model on train data using workflow
penguin_wf %>%
fit(penguins_train_test) -> model
use the model to predict on new data
predict(model, penguins_newdata)
output
# A tibble: 101 x 1
.pred_class
<fct>
1 female
2 male
3 male
4 male
5 female
6 male
7 female
8 male
9 male
10 female
# ... with 91 more rows
Here no tuning is performed and the model is made with default parameters. When you tune the hyper parameters via some sort of resampling which you are able to do as I gather from your question, you can extract them from the tune result best on a specific metric
param_final <- rf_tune_results %>%
select_best(metric = "auc")
and set them in the workflow
rf_workflow <- rf_workflow %>%
finalize_workflow(param_final)
that way when you fit the model on the train data the optimal hyper parameters will be used.
Additional details are given in the link I posted in the comment.
Upvotes: 4