Reputation: 51
I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty
should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes
column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
Upvotes: 1
Views: 525
Reputation: 8572
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit
and for it to work you'll have to provide the data for the fitting process (here diamonds
). The tradeoff is that you don't have to prep
your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples
and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet
. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows
package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg
then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold
as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics
) would then be selected for use and the test
dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11
Upvotes: 1