Alex Stepanov
Alex Stepanov

Reputation: 21

How to predict future values of time series using h2o.predict

I am going through the book "Hands-on Time series analysis with R" and I am stuck at the example using machine learning h2o package. I don't get how to use h2o.predict function. In the example it requires newdata argument, which is test data in this case. But how do you predict future values of time series if you in fact don't know these values ?

If I just ignore newdata argument I get : predictions with a missing newdata argument is not implemented yet.

library(h2o)

h2o.init(max_mem_size = "16G")


train_h <- as.h2o(train_df)
test_h <- as.h2o(test_df)
forecast_h <- as.h2o(forecast_df)


x <- c("month", "lag12", "trend", "trend_sqr")
y <- "y"

rf_md <- h2o.randomForest(training_frame = train_h,
                          nfolds = 5,
                          x = x,
                          y = y,
                          ntrees = 500,
                          stopping_rounds = 10,
                          stopping_metric = "RMSE",
                          score_each_iteration = TRUE,
                          stopping_tolerance = 0.0001,
                          seed = 1234)

h2o.varimp_plot(rf_md)

rf_md@model$model_summary

library(plotly)

tree_score <- rf_md@model$scoring_history$training_rmse
plot_ly(x = seq_along(tree_score), y = tree_score,
        type = "scatter", mode = "line") %>%
  layout(title = "Random Forest Model - Trained Score History",
         yaxis = list(title = "RMSE"),
         xaxis = list(title = "Num. of Trees"))

test_h$pred_rf <- h2o.predict(rf_md, test_h)

test_1 <- as.data.frame(test_h)

mape_rf <- mean(abs(test_1$y - test_1$pred_rf) / test_1$y)
mape_rf

Upvotes: 2

Views: 1811

Answers (2)

Darren Cook
Darren Cook

Reputation: 28913

The training data, train_df has to have all the columns listed in both x (c("month", "lag12", "trend", "trend_sqr")) and y ("y"), whereas the data you give to h2o.predict() just has to have the columns in x; the y-column is what will be returned as the prediction.

As you have features (in x) that are things like lag, trend, etc. the fact that it is a time series does not matter. (But you do have to be very careful when preparing those features to make sure you do not use any information in them that was not known at that point in time - but I would imagine the book has already been emphasizing that.)

Normally with a time series, for a given row in the training data, your x data is the data known at time t, and the value in the y column is the value of interest at time t+1. So when doing a prediction, you give the x values as the values at the moment, and the prediction returned is what will happen next.

Upvotes: 0

Lauren
Lauren

Reputation: 5778

H2O-3 does not support traditional time series algorithms (e.g., ARIMA). Instead, the recommendation is to treat the time series use case as a supervised learning problem and perform time-series specific pre-processing.

For example, if your goal was to predict the sales for a store tomorrow, you could treat this as a regression problem where your target would be the Sales. If you try to train a supervised learning model on the raw data, however, chances are your performance would be pretty poor. So the trick is to add historical attributes like lags as a pre-processing step.

If we trained a model on an unaltered dataset, the Mean Absolute Error is around 35%.

enter image description here

If we start adding historical features like the sales from the previous day for that store, we can reduce the Mean Absolute Error to about 15%.

enter image description here

While H2O-3 does not support lagging, you can leverage Sparkling Water to perform this pre-processing. You can use Spark to generate the lags per group and then use H2O-3 to train the regression model. Here is an example of this process: https://github.com/h2oai/h2o-tutorials/tree/master/best-practices/forecasting

Upvotes: 2

Related Questions