Reputation: 55
I have a daily time series about the sales of a product, my series start from 01/01/2016 until 31/08/2017.
Considering that it is a six-day week (my week starts on Monday and ends Saturday) there is no data for Sundays, I understand that before running an Arima model I need first to fill the missing values. This is where I need help: I've read that I can fill the missing values with na.approx
or NA
, but I do not know how to do that.
You could see my series here:
https://drive.google.com/file/d/0BzIf8XvzKOGWSm1ucUdYUVhfVGs/view?usp=sharing
As you can see, there is no data for Sundays. I need to know how to fill the missing values to run an Arima model and be able to forecast what's left of 2017.
Upvotes: 3
Views: 7972
Reputation: 7730
In principle you could use a imputeTS (for filling the NAs) - forecast (for doing the forecast) combination.
It can be done quite easily:
library("imputeTS")
library("forecast")
ts_sunday %>% na_kalman() %>% auto.arima() %>% forecast(h=10)
Would do the job. But in this specific case this would be a bad idea. If the data would be missing at random you could consider this solution. But it is not - it's always Sundays that are missing. Some time series models can also deal with NAs and still build a model.(but the drawbacks are nearly the same as with the previous solution). How should a model treat Sundays, since it is never observed... Probably the best solution (from a statistics perspective) is what avid_useR
described in another answer as removing Sundays completely. If you don't need Sundays - and anyway have no values for Sundays then just remove them. But usually this sooner or later leads to the next question ..'how to treat public holidays' - which are also often NA. Also always keep your problem in mind - one solution might be a fit for one setting - for another it might not make sense.
Upvotes: 0
Reputation: 18661
Here're three ways of doing it:
library(lubridate)
library(xts)
library(dplyr)
library(forecast)
df$Date = mdy(df$Date)
Removing Sundays:
ts_no_sunday = df %>%
filter(wday(df$Date) != 1) %>%
{xts(.$Units, .$Date)}
plot(ts_no_sunday)
no_sunday_arima = auto.arima(ts_no_sunday)
plot(forecast(no_sunday_arima, h = 10))
Replace Sundays with NAs:
ts_sunday = df %>%
mutate(Units = replace(Units, which(wday(df$Date) == 1), NA)) %>%
{xts(.$Units, .$Date)}
plot(ts_sunday)
sunday_arima = auto.arima(ts_sunday)
plot(forecast(sunday_arima, h = 10))
Interpolate Sundays:
ts_interp = df %>%
mutate(Units = replace(Units, which(wday(df$Date) == 1), NA),
Units = na.approx(Units)) %>%
{xts(.$Units, .$Date)}
plot(ts_interp)
interp_arima = auto.arima(ts_interp)
plot(forecast(interp_arima, h = 10))
Notes:
As one can see, they produce different forecasts. This is because the first time series is irregular, the second is a regular time series with missing values, and the third is a regular time series with interpolated data. In my opinion, a better way to deal with missing values is to interpolate before fitting an ARIMA, since ARIMA assumes that the time series is regularly spaced. This however, also depends on whether your "missing" data points are actually missing, and not a stop in activity. The former should be treated with interpolation, while for the latter you might just be better off removing Sundays and treat the time series as if Sundays don't exist.
See this discussion on How to handle nonexistent or missing data? and this on Using the R forecast package with missing values and/or irregular time series
Upvotes: 6