Anakin Skywalker
Anakin Skywalker

Reputation: 2520

Evaluating Prophet model in R, using cross-validation

I am trying to cross-validate a Prophet model in R. The problem - this package does not work well with monthly data.

I managed to build the model

and even used a custom monthly seasonality.

as recommended by authors of this tool.

But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something.

Currently my code looks like this

model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days"))

Updated:

Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data.

Also metrics do not look that good

Upvotes: 0

Views: 1379

Answers (1)

DPH
DPH

Reputation: 4344

I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually.

library(dplyr)
library(prophet)
library(data.table)

#training data
df <- data.table::fread("ds     y
              1992-01-01    146376
              1992-02-01    147079
              1992-03-01    159336
              1992-04-01    163669
              1992-05-01    170068
              1992-06-01    168663
              1992-07-01    169890
              1992-08-01    170364
              1992-09-01    164617
              1992-10-01    173655
              1992-11-01    171547
              1992-12-01    208838
              1993-01-01    153221
              1993-02-01    150087
              1993-03-01    170439
              1993-04-01    176456
              1993-05-01    182231
              1993-06-01    181535
              1993-07-01    183682
              1993-08-01    183318
              1993-09-01    177406
              1993-10-01    182737
              1993-11-01    187443
              1993-12-01    224540
              1994-01-01    161349
              1994-02-01    162841
              1994-03-01    192319
              1994-04-01    189569
              1994-05-01    194927
              1994-06-01    197946
              1994-07-01    193355
              1994-08-01    202388
              1994-09-01    193954
              1994-10-01    197956
              1994-11-01    202520
              1994-12-01    241111
              1995-01-01    175344
              1995-02-01    172138
              1995-03-01    201279
              1995-04-01    196039
              1995-05-01    210478
              1995-06-01    211844
              1995-07-01    203411
              1995-08-01    214248
              1995-09-01    202122
              1995-10-01    204044
              1995-11-01    212190
              1995-12-01    247491
              1996-01-01    185019
              1996-02-01    192380
              1996-03-01    212110
              1996-04-01    211718
              1996-05-01    226936
              1996-06-01    217511
              1996-07-01    218111")

df <- df %>% 
  dplyr::mutate(ds = as.Date(ds))

model <- prophet::prophet(df)

(tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3))

         y         ds     yhat yhat_lower yhat_upper              cutoff
 1: 175344 1995-01-01 170988.8   170145.9   171828.0 1994-12-31 02:00:00
 2: 172138 1995-02-01 178117.4   176975.2   179070.2 1995-01-30 12:00:00
 3: 201279 1995-03-01 211462.8   210277.4   212670.8 1995-01-30 12:00:00
 4: 196039 1995-04-01 200113.9   198079.5   201977.8 1995-03-01 22:00:00
 5: 210478 1995-05-01 202100.5   200390.8   203797.9 1995-04-01 08:00:00
 6: 211844 1995-06-01 208330.5   206229.9   210497.4 1995-05-01 18:00:00
 7: 203411 1995-07-01 202563.8   200786.5   204313.0 1995-06-01 04:00:00
 8: 214248 1995-08-01 214639.6   212748.3   216461.3 1995-07-01 14:00:00
 9: 202122 1995-09-01 204954.0   203048.9   206768.4 1995-08-31 12:00:00
10: 204044 1995-10-01 205097.5   203209.7   206882.3 1995-09-30 22:00:00
11: 212190 1995-11-01 213586.7   211728.1   215617.6 1995-10-31 08:00:00
12: 247491 1995-12-01 251518.8   249708.2   253589.2 1995-11-30 18:00:00
13: 185019 1996-01-01 182403.7   180520.1   184494.7 1995-12-31 04:00:00
14: 192380 1996-02-01 184722.9   182772.7   186686.9 1996-01-30 14:00:00
15: 212110 1996-03-01 205020.1   202823.2   206996.9 1996-01-30 14:00:00
16: 211718 1996-04-01 214514.0   211891.9   217175.3 1996-03-31 14:00:00
17: 226936 1996-05-01 218845.2   216133.8   221420.4 1996-03-31 14:00:00
18: 217511 1996-06-01 218672.2   216007.8   221459.9 1996-05-31 14:00:00
19: 218111 1996-07-01 221156.1   218540.7   224184.1 1996-05-31 14:00:00

The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result.

But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output:

(tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3))

         y         ds     yhat yhat_lower yhat_upper              cutoff
 1: 172138 1995-02-01 178117.4   177075.3   179203.9 1995-01-29 13:30:00
 2: 201279 1995-03-01 211462.8   210340.5   212607.3 1995-01-29 13:30:00
 3: 196039 1995-04-01 200113.9   198022.6   202068.1 1995-03-31 13:30:00
 4: 210478 1995-05-01 204100.2   202009.8   206098.7 1995-03-31 13:30:00
 5: 211844 1995-06-01 208330.5   206114.5   210515.8 1995-05-31 13:30:00
 6: 203411 1995-07-01 202606.0   200319.1   204663.4 1995-05-31 13:30:00
 7: 214248 1995-08-01 214639.6   212684.4   216495.7 1995-07-31 22:30:00
 8: 202122 1995-09-01 204954.0   203127.7   206951.0 1995-08-31 09:00:00
 9: 204044 1995-10-01 205097.5   203285.3   207036.5 1995-09-30 19:30:00
10: 212190 1995-11-01 213586.7   211516.8   215516.2 1995-10-31 06:00:00
11: 247491 1995-12-01 251518.8   249658.3   253590.1 1995-11-30 16:30:00
12: 185019 1996-01-01 182403.7   180359.7   184399.2 1995-12-31 03:00:00
13: 192380 1996-02-01 184722.9   182652.4   186899.8 1996-01-30 13:30:00
14: 212110 1996-03-01 205020.1   203040.3   207171.9 1996-01-30 13:30:00
15: 211718 1996-04-01 214514.0   211942.6   217252.6 1996-03-31 13:30:00
16: 226936 1996-05-01 218845.2   216203.1   221506.5 1996-03-31 13:30:00
17: 217511 1996-06-01 218672.2   215823.9   221292.4 1996-05-31 13:30:00
18: 218111 1996-07-01 221156.1   218236.7   223862.0 1996-05-31 13:30:00

Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.

Upvotes: 1

Related Questions