What does this actual vs. predicted plot mean?

Question

I am currently testing different models to get the best predicted outcome. My measure for model effectiveness is RMSE. I am using the tidymodels package to go through this, and have used regular grids to tune models with 5-fold cross validation and three repeats as the resampling method. The best performing model was a random forest model, followed by boosted tree. My recipe includes all possible predictors:

all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
 step_rm(url, timedelta, shares) %>%
 step_normalize() %>% 
 step_dummy(all_nominal_predictors())

After fitting to the training set, and making predictions on the test set, I plotted the actual vs. predicted values. I got the following graph:

My question is, why does this visualization seem so off? Or what exactly does it mean?

The plot looks the same for other model types (boosted tree, multilayer perceptron) and with another recipe that contains fewer predictors that I selected through an EDA.

EmilHvitfeldt · Accepted Answer

This shape of the plot, shows us that your models are not able to predict values far away from the mean of your models.

We see that the actual values of the data ranges from 1 to 6 (on the log scale), but the predictions are only on the range 2.75 to 4.

This can be for a couple of problems. First, it appears you are not applying normalization, you have

all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
 step_rm(url, timedelta, shares) %>%
 step_normalize() %>% 
 step_dummy(all_nominal_predictors())

but what you actually want is something like

all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
 step_rm(url, timedelta, shares) %>%
 step_normalize(all_numeric_predictors()) %>% 
 step_dummy(all_nominal_predictors())

to specify what variables step_normalize() should be applied to. I'm using all_numeric_predictors() here, you will have to modify as you see fit.

Secondly, there might be some information you are not using to its fullest potention. I recommend that you look at the residuals to see if there is anything special about the worst predicted values

augment(my_fit, new_data = my_data) |>
  dplyr::arrange(.resid)

Lastly, there is a chance that your data doesn't have enough information, and that this is the best fit you are able to get.

What does this actual vs. predicted plot mean?

Answers (1)

Related Questions