Reputation: 305
I'm trying to forecast time series with auto.arima. What I need is to split in train and test data to see the model metrics. my date range is between December 2016 and January 2020. I need the train data to be until December 2018 and the test after this.
Along with this, I need the metrics RMSE and MSE for each id. This is an example of my data:
x<- tibble::tribble(
~ID, ~Date, ~Value,
1L, "01-12-2016", 48L,
1L, "01-01-2017", 10055L,
1L, "01-02-2017", 650L,
1L, "01-03-2017", 8255L,
1L, "01-04-2017", 3680L,
1L, "01-05-2017", 2180L,
1L, "01-06-2017", 2790L,
1L, "01-07-2017", 3805L,
1L, "01-08-2017", 2811L,
1L, "01-09-2017", -225L,
1L, "01-10-2017", -232L,
1L, "01-11-2017", -243L,
1L, "01-12-2017", -217L,
1L, "01-01-2018", -256L,
1L, "01-02-2018", -277L,
1L, "01-03-2018", -3L,
1L, "01-04-2018", -247L,
1L, "01-05-2018", 88L,
1L, "01-06-2018", -260L,
1L, "01-07-2018", -228L,
1L, "01-08-2018", -285L,
1L, "01-09-2018", -321L,
1L, "01-10-2018", -265L,
1L, "01-11-2018", -302L,
1L, "01-12-2018", -11968L,
1L, "01-01-2019", 5435L,
1L, "01-02-2019", 6694L,
1L, "01-03-2019", 4750L,
1L, "01-04-2019", 3747L,
1L, "01-05-2019", 3727L,
1L, "01-06-2019", 3252L,
1L, "01-07-2019", 1691L,
1L, "01-08-2019", 2489L,
1L, "01-09-2019", -182L,
1L, "01-10-2019", 3926L,
1L, "01-11-2019", 326L,
1L, "01-12-2019", -1047L,
1L, "01-01-2020", 14L,
2L, "01-12-2016", -241L,
2L, "01-01-2017", -262L,
2L, "01-02-2017", -231L,
2L, "01-03-2017", -203L,
2L, "01-04-2017", -226L,
2L, "01-05-2017", -223L,
2L, "01-06-2017", -300L,
2L, "01-07-2017", -259L,
2L, "01-08-2017", -241L,
2L, "01-09-2017", -225L,
2L, "01-10-2017", -227L,
2L, "01-11-2017", -243L,
2L, "01-12-2017", -217L,
2L, "01-01-2018", -256L,
2L, "01-02-2018", -277L,
2L, "01-03-2018", 0L,
2L, "01-04-2018", -247L,
2L, "01-05-2018", -274L,
2L, "01-06-2018", -264L,
2L, "01-07-2018", -227L,
2L, "01-08-2018", -275L,
2L, "01-09-2018", -325L,
2L, "01-10-2018", -269L,
2L, "01-11-2018", -306L,
2L, "01-12-2018", -264L,
2L, "01-01-2019", -308L,
2L, "01-02-2019", -332L,
2L, "01-03-2019", -260L,
2L, "01-04-2019", -300L,
2L, "01-05-2019", -302L,
2L, "01-06-2019", -291L,
2L, "01-07-2019", -284L,
2L, "01-08-2019", -288L,
2L, "01-09-2019", -272L,
2L, "01-10-2019", 0L,
2L, "01-11-2019", 0L,
2L, "01-12-2019", -17107L,
2L, "01-01-2020", 3500L,
3L, "01-12-2016", 1940L,
3L, "01-01-2017", 1753L,
3L, "01-02-2017", 2758L,
3L, "01-03-2017", 2539L,
3L, "01-04-2017", -9078L,
3L, "01-05-2017", 5215L,
3L, "01-06-2017", 1796L,
3L, "01-07-2017", -8424L,
3L, "01-08-2017", 19868L,
3L, "01-09-2017", 10707L,
3L, "01-10-2017", 8985L,
3L, "01-11-2017", 3058L,
3L, "01-12-2017", 2469L,
3L, "01-01-2018", 21L,
3L, "01-02-2018", 1039L,
3L, "01-03-2018", 2875L,
3L, "01-04-2018", -2678L,
3L, "01-05-2018", 1515L,
3L, "01-06-2018", 2651L,
3L, "01-07-2018", -5014L,
3L, "01-08-2018", 299L,
3L, "01-09-2018", 1755L,
3L, "01-10-2018", 5009L,
3L, "01-11-2018", 2857L,
3L, "01-12-2018", 2909L,
3L, "01-01-2019", 1353L,
3L, "01-02-2019", 2337L,
3L, "01-03-2019", 3019L,
3L, "01-04-2019", -531L,
3L, "01-05-2019", -1055L,
3L, "01-06-2019", 1706L,
3L, "01-07-2019", -507L,
3L, "01-08-2019", 2234L,
3L, "01-09-2019", 890L,
3L, "01-10-2019", 94L,
3L, "01-11-2019", -1781L,
3L, "01-12-2019", 102590L,
3L, "01-01-2020", 471L,
4L, "01-12-2016", 2658L,
4L, "01-01-2017", 2344L,
4L, "01-02-2017", 2728L,
4L, "01-03-2017", -58L,
4L, "01-04-2017", -226L,
4L, "01-05-2017", -5L,
4L, "01-06-2017", -300L,
4L, "01-07-2017", -259L,
4L, "01-08-2017", -241L,
4L, "01-09-2017", -225L,
4L, "01-10-2017", -229L,
4L, "01-11-2017", -243L,
4L, "01-12-2017", -217L,
4L, "01-01-2018", -245L,
4L, "01-02-2018", -277L,
4L, "01-03-2018", -155L,
4L, "01-04-2018", 5437L,
4L, "01-05-2018", 2866L,
4L, "01-06-2018", 3091L,
4L, "01-07-2018", 3669L,
4L, "01-08-2018", 311L,
4L, "01-09-2018", 4120L,
4L, "01-10-2018", 2357L,
4L, "01-11-2018", -4759L,
4L, "01-12-2018", 4220L,
4L, "01-01-2019", 2730L,
4L, "01-02-2019", 2515L,
4L, "01-03-2019", 2560L,
4L, "01-04-2019", 2864L,
4L, "01-05-2019", 1935L,
4L, "01-06-2019", 938L,
4L, "01-07-2019", 3268L,
4L, "01-08-2019", 3232L,
4L, "01-09-2019", 3347L,
4L, "01-10-2019", 4241L,
4L, "01-11-2019", -247L,
4L, "01-12-2019", 179L,
4L, "01-01-2020", 2542L
)
x<-as.data.frame(x)
I tried with dplyr:
x %>%
group_by(ID) %>%
do(fit=auto.arima(.$Value,seasonal = F,stepwise = F,approximation = F),
fit_forecast=forecast(auto.arima(.$Value,seasonal = F,stepwise = F,approximation = F),h=12))
But I don't know how to add step of train and test and the metrics. Does anyone know how to solve it? Thanks !
Upvotes: 0
Views: 993
Reputation: 31800
I've already answered the same question at https://community.rstudio.com/t/autoarima-train-and-test-grouped-by-id-in-r/66400, but for the record here it is again.
This is much easier to do with the tsibble and fable packages, like this.
library(dplyr)
library(tsibble)
library(lubridate)
library(fable)
# Turn x into a tsibble object
x <- x %>%
mutate(Date = yearmonth(dmy(Date))) %>%
as_tsibble(index = Date, key = ID)
# Use filter to create training set
# Then fit non-seasonal ARIMA models
fit <- x %>%
filter(Date <= yearmonth("2018 Dec")) %>%
model(ARIMA(Value ~ PDQ(0,0,0), stepwise=FALSE, approximation=FALSE))
# Now forecast the test set and compute RMSE and MSE
fit %>%
forecast(h = 13) %>%
accuracy(x) %>%
mutate(MSE = RMSE^2) %>%
select(ID, RMSE, MSE)
#> # A tibble: 4 x 3
#> ID RMSE MSE
#> <int> <dbl> <dbl>
#> 1 1 9696. 94011195.
#> 2 2 4792. 22964718.
#> 3 3 27899. 778326636.
#> 4 4 1776. 3153398.
Created on 2020-05-18 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 541
You simply subset your dataframe into training and test. Typically 80/20 split or 90/10 split. But in your case, since you seem to want specific dates, you can even hardcode the number of rows.
library("lubridate")
x$Date <- dmy(x$Date)
x <- x[order(x$Date),]
rownames(x) <- NULL
x.train <- x[1:139,]
x.text <- x[-(1:139),]
Typically the data is also sampled to get the 80/20 or 90/10. But in a timeseries, you want to want to simply split instead of sampling.
Upvotes: 0