datawookie
datawookie

Reputation: 6524

Large uncertainties for holidays in Prophet model

I'm building a time series model with Prophet and getting some weird behaviour with the uncertainties around holidays which I don't understand.

The data are from Google Trends and relate to searches for the term "flowers".

library(dplyr)
library(gtrendsR)
library(prophet)

flowers <- gtrends("flowers")$interest_over_time

flowers <- flowers %>% select(ds = date, y = hits)

As you might expect, this time series has peaks around two important days: Valentine's Day and Mothers' Day.

To take these days into account in my model I created a dataframe with the relevant dates for the period of interest.

holidays <- rbind(
  data.frame(
    holiday = "mothers_day",
    ds = as.Date(c(
      # Second Sunday of May.
      '2014-05-11',
      '2015-05-10',
      '2016-05-08',
      '2017-05-14',
      '2018-05-13',
      '2019-05-12',
      '2020-05-10'
    )),
    lower_window = -7,       # Extend holiday to 7 days before nominal date
    upper_window = +7,       # Extend holiday to 7 days after nominal date
    prior_scale = 1
  ),
  data.frame(
    holiday = "valentines_day",
    ds = as.Date(c(
      '2014-02-14',
      '2015-02-14',
      '2016-02-14',
      '2017-02-14',
      '2018-02-14',
      '2019-02-14',
      '2020-02-14'
    )),
    lower_window = -7,       # Extend holiday to 7 days before nominal date
    upper_window = +7,       # Extend holiday to 7 days after nominal date
    prior_scale = 1
  )
)

Since the time series data are at weekly intervals, I used the lower_window and upper_window to extend the effect of the holidays on either side of the nominal date.

Now fit a moment using those holidays.

flowers_prophet <- prophet(
  holidays = holidays,
  mcmc.samples = 300
)

flowers_prophet <- fit.prophet(
  flowers_prophet,
  flowers
)

With the model in hand we can make predictions.

flowers_future <- make_future_dataframe(flowers_prophet,
                                        periods = 52,
                                        freq = 'week')

flowers_forecast <- predict(flowers_prophet, flowers_future)

prophet_plot_components(flowers_prophet, flowers_forecast)

And this is where things get weird.

Components of time series predictions

The trend and the annual variation look perfectly reasonable. The variations associated with the historical holidays look good too. Mothers' Day 2020 looks fine. However, Valentine's Day 2020 has a small predicted value (relative to historical values) and extremely large uncertainties.

The actual time series looks good: historical values are fit well and the prediction for Mother's Day 2020 looks eminently reasonable. But the value and uncertainties for Valentine's Day 2020 just don't look right.

Time series prediction

If anybody can help me understand why the predictions for these two holidays are so different I'd be extremely grateful.

Upvotes: 2

Views: 1016

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

Since Valentine's day is always the 14th, but the google trends data is every 7 days, there's a misalignment in the historical data. In 2016, the peak was during the week called "2016-02-07", 1 whole week prior to the holiday, while the next year the peak week was called "2017-02-12", only 2 days prior.

library(lubridate)
flowers %>%  
  filter(month(date) == 2) %>%
  group_by(yr = year(date)) %>%
  arrange(-hits) %>%
  slice(1)

# A tibble: 5 x 7
# Groups:   yr [5]
  date                 hits keyword geo   gprop category    yr
  <dttm>              <int> <chr>   <chr> <chr>    <int> <dbl>
1 2015-02-08 00:00:00    87 flowers world web          0  2015
2 2016-02-07 00:00:00    79 flowers world web          0  2016
3 2017-02-12 00:00:00    88 flowers world web          0  2017
4 2018-02-11 00:00:00    91 flowers world web          0  2018
5 2019-02-10 00:00:00    89 flowers world web          0  2019

I suspect the problem is that prophet is in some cases interpreting the 14th as being near the peak and sometimes a whole week after the peak. It sees a spike, but its timing doesn't have a consistent alignment with the holiday date you specified. I'm not quite sure how to get around that without manually removing the temporal inconsistency.

If we shift the holidays to align to the dates they correspond to in the data, we get a better fit:

...  # using this list for valentines day dates, corresponding to peaks in data
holiday = "valentines_day",
    ds = as.Date(c(
      '2015-02-08',
      '2016-02-07',
      '2017-02-12',
      '2018-02-11',
      '2019-02-10',
      '2020-02-09'  # Corresponds to the Sunday beforehand, like prior spikes here
    ))
...

Resulting in:

trend chart

Upvotes: 2

Related Questions