psysky
psysky

Reputation: 3195

R::forecast Two-factor forecast

I need to perform the forecast in terms of product and mall lines. Little part of my dataset.

date        mall    product price
01.01.2017  mall1   prod1   94
01.01.2017  mall1   prod1   65
01.01.2017  mall1   prod1   50
01.01.2017  mall1   prod1   92
01.01.2017  mall1   prod2   97
01.01.2017  mall1   prod2   80
01.01.2017  mall1   prod2   51
01.01.2017  mall1   prod2   90
01.01.2017  mall1   prod3   52
01.01.2017  mall1   prod3   73
01.01.2017  mall1   prod3   59
01.01.2017  mall1   prod3   85
01.01.2017  mall2   prod1   56
01.01.2017  mall2   prod1   60
01.01.2017  mall2   prod1   89
01.01.2017  mall2   prod1   87
01.01.2017  mall2   prod2   77
01.01.2017  mall2   prod2   79
01.01.2017  mall2   prod2   99
01.01.2017  mall2   prod2   59
01.01.2017  mall2   prod3   98
01.01.2017  mall2   prod3   50
01.01.2017  mall2   prod3   54
01.01.2017  mall2   prod3   98
02.01.2017  mall1   prod1   60
02.01.2017  mall1   prod1   68
02.01.2017  mall1   prod1   65
02.01.2017  mall1   prod1   81
02.01.2017  mall1   prod2   74
02.01.2017  mall1   prod2   63
02.01.2017  mall1   prod2   88
02.01.2017  mall1   prod2   71
02.01.2017  mall1   prod3   67
02.01.2017  mall1   prod3   73
02.01.2017  mall1   prod3   62
02.01.2017  mall1   prod3   57
02.01.2017  mall2   prod1   51
02.01.2017  mall2   prod1   65
02.01.2017  mall2   prod1   100
02.01.2017  mall2   prod1   67
02.01.2017  mall2   prod2   74
02.01.2017  mall2   prod2   70
02.01.2017  mall2   prod2   60
02.01.2017  mall2   prod2   97
02.01.2017  mall2   prod3   90
02.01.2017  mall2   prod3   100
02.01.2017  mall2   prod3   72
02.01.2017  mall2   prod3   50

For each product of each mall, i need do forecast on two day in advance. I found this forum, when i was searching library for R and found library::forecast, with ets function. So how to write the loop or function which performs forecast for each product of each mall. Ideally, the output must be such

date        mall    product price
03.01.2017  mall1   prod1   pred.value
03.01.2017  mall1   prod2   pred.value
03.01.2017  mall1   prod3   pred.value
03.01.2017  mall1   prod4   pred.value
03.01.2017  mall2   prod1   pred.value
03.01.2017  mall2   prod2   pred.value
03.01.2017  mall2   prod3   pred.value
03.01.2017  mall2   prod4   pred.value
04.01.2017  mall1   prod1   pred.value
04.01.2017  mall1   prod2   pred.value
04.01.2017  mall1   prod3   pred.value
04.01.2017  mall1   prod4   pred.value
04.01.2017  mall2   prod1   pred.value
04.01.2017  mall2   prod2   pred.value
04.01.2017  mall2   prod3   pred.value
04.01.2017  mall2   prod4   pred.value

Any help is valuable.

Upvotes: 0

Views: 432

Answers (1)

Stéphane
Stéphane

Reputation: 207

Essentially, you are forcasting (number of products) x (number of malls) variables, two days in advance. All of your data is limited to product prices for each product, each mall, every day.

The first thing you need to do is to specify a set of forecasting models that you will compare in some way to determine how you will produce forecasts. You can use ARIMA-type models, or non-parametric methods such as Support Vector Regression, to related current prices to past prices.

Let's say you want to use ARIMA-type models and want to compare, say, the ARMA(1,1) to the AR(2) model. The idea is to choose a fraction of your dataset toward the end. Say, you keep the last 20% of your dataset. You take the first 80% minus the last two days, you estimate an AR(2) and an ARMA(1,1) on that data. You then use it to forecast the first day of the 20% you left out. Then, you move the end of your window by one day. If you want to keep the estimation always on the same number of data points, you can also discard the first observation. You estimate all models again and produce the second forecast. You produce all those forecasts, for all your models.

Then, since you know what values were realized, you can compute 2-days ahead forecast errors for every single model over the last 20% of your dataset. You can measure the mean squared error, the mean absolute error, the percentage of correct sign prediction, the percentage of errors falling in an interval around the forecasted value, just as you can produce various other statistical measure of performance out-of-sample using those errors. Every such statistic will help you rank all models -- if you have many statistics, you can visualize how models perform using a spider chart, if you like.

Now, how do you code that? I simulate data and the seed is provided so you can see how each part works. Basically, you pick a subsample and you estimate models, forecast and collect errors over that subsample for each model. If you want to make things more complicated, you can add another layer to the loop to go through many AR(p) and ARMA(p,q) models, collect say, BIC values, and produce the forecast as the minimal BIC value. You can aso code a least square estimate of the AR model and instead of producing an iterative forecast ('forecast' uses the structure of the ARIMA model to generate a forecast through a recursive equation) you can produce a direct forecast. Direct forecasting means your begin lags at the horizon of the forecast -- here, you would have y_{t+2} = constant + phi_1 y_t + ... + phi_p y_{t-p} + e_{t+h}, so you skip y_{t+1}.

Direct forecasts for AR models tend to perform slightly better. As for ARMA, I would not advise going for p,q > 1 for forecasting. ARMA(1,1) is a first order approximation to both infinite MAs and ARs, so it does capture complicated (but linear) responses. Obviously, you can use packages like 'e1071' and train support vector machines, if you want. It comes with a tune function to adjust hyperparameters and kernel parameters, as well as subsampling and predict functions to make choices and produce forecasts -- and, code-wide, it's not more complicated than what you see bellow.

And, if you did not think about it, once you have a few forecasting models, you can use the mean of forecasts, the median of forecasts or an optimized convex combinations of forecasts as a forecasting model -- that tends to be the best and it's not harder or longer once you have a few models to compare.

library(forecast)

set.seed(1030)
e <- rnorm(n=1000, sd=1, mean=0)  # Create errors for simulation
y <- array(data=0, dim=c(1000,1)) # Create vector to hold values
phi <- 0.8

# Simulate an AR(1) process
for (i in 2:length(y)){
  y[i,1] <- phi*y[i-1,1] + e[i]
}

# Now, we'll use only  the last half of the sample. It doesn't matter that
# we started at 0 because an AR(1) procees with abs(phi) < 1 is ergodic and
# stationnary.
y <- y[501:1000,1]

# Now we have data, we can estimate a model and produce an out-of-sample
# exercise:
poos <- c(250:length(y))                      # We use the last half
forecast_ar <- array(NA, dim=c(length(poos))) # Same size as poos
forecast_arma <- forecast_ar
error <- forecast_ar
error_arma <- error

for (i in poos){
  # AR model
  a <- Arima(y = y[1:(i-2)],          # Horizon = 2 periods
             order = c(1,0,0),
             seasonal = c(0,0,0),
             include.constant = TRUE) # We estimate an AR(1) model
  forecast_ar[i] <- forecast(a, h=2)$mean[2]
  error[i] <- y[i] - forecast_ar[i]

  # ARMA model
  a <- Arima(y = y[1:(i-2)],          # Horizon = 2 periods
             order = c(1,0,1),
             seasonal = c(0,0,0),
             include.constant = TRUE) # We estimate an ARMA(1,1) model
  forecast_arma[i] <- forecast(a, h=2)$mean[2]
  error_arma[i] <- forecast_arma [i]
}

Upvotes: 2

Related Questions