Reputation: 27
I am currently testing different models to get the best predicted outcome. My measure for model effectiveness is RMSE. I am using the tidymodels
package to go through this, and have used regular grids to tune models with 5-fold cross validation and three repeats as the resampling method. The best performing model was a random forest model, followed by boosted tree. My recipe includes all possible predictors:
all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
step_rm(url, timedelta, shares) %>%
step_normalize() %>%
step_dummy(all_nominal_predictors())
After fitting to the training set, and making predictions on the test set, I plotted the actual vs. predicted values. I got the following graph:
My question is, why does this visualization seem so off? Or what exactly does it mean?
The plot looks the same for other model types (boosted tree, multilayer perceptron) and with another recipe that contains fewer predictors that I selected through an EDA.
Upvotes: 1
Views: 757
Reputation: 3185
This shape of the plot, shows us that your models are not able to predict values far away from the mean of your models.
We see that the actual values of the data ranges from 1 to 6 (on the log scale), but the predictions are only on the range 2.75 to 4.
This can be for a couple of problems. First, it appears you are not applying normalization, you have
all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
step_rm(url, timedelta, shares) %>%
step_normalize() %>%
step_dummy(all_nominal_predictors())
but what you actually want is something like
all_recipe <- recipe(log_shares ~ ., data = pop_train) %>%
step_rm(url, timedelta, shares) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
to specify what variables step_normalize()
should be applied to. I'm using all_numeric_predictors()
here, you will have to modify as you see fit.
Secondly, there might be some information you are not using to its fullest potention. I recommend that you look at the residuals to see if there is anything special about the worst predicted values
augment(my_fit, new_data = my_data) |>
dplyr::arrange(.resid)
Lastly, there is a chance that your data doesn't have enough information, and that this is the best fit you are able to get.
Upvotes: 4