girijesh96
girijesh96

Reputation: 455

issue in finding acf value for a Time-Series problem

I'm trying to find the value of acf in a time series problem. I have a dataset from 2003 to 2017.

I am creating the time-series of data using following function

tf = ts(df$x, start = c(2003,1), end = c(2017,12), frequency = 12)

When I am trying to find value of acf by plotting using function

acf(ts)

My graph is like this

enter image description here

I am not able to conclude what should be the value of 'p' while using time-series function. As graph is showing no inversion

fit = arima(tf, c(p,2,3))

For reference,https://github.com/girijesh18/dataset/blob/master/timeSeries.csv my data looks like this

I am unable to figure out what value should I use for 'p'. I have also tried different values for 'p' between range of 1 to 20 but predicted value are not very accurate. Any help would be appreciated.

Upvotes: 3

Views: 1687

Answers (2)

AkselA
AkselA

Reputation: 8846

First off, I'd take the difference of your time series to detrend it, it has a considerable stochastic trend. The most obvious sign of this, beside the time series having a steady upward rise, is that the ACF components takes a long time to die out. While you can fit a model to the data as is with a drift term, it's much easier to read the ACF and PACF plots when the trend isn't there.

tf <- structure(c(58082, 48500, 45723, 53662, 46723, 45070, 49782, 55437,
57672, 61121, 43857, 49819, 50750, 53589, 53812, 53575, 52339, 51115,
56529, 61498, 58757, 72876, 55999, 58374, 63885, 63287, 60027, 65795,
62850, 61908, 68108, 72639, 77105, 84996, 65488, 62178, 74750, 66085,
59711, 69304, 68357, 67133, 74545, 73623, 82071, 89533, 72117, 69004,
72681, 80214, 80744, 81643, 87599, 86213, 97495, 97841, 104953, 110353,
90415, 83875, 93160, 89539, 85021, 91314, 87036, 83731, 91047, 94552,
105628, 94743, 84954, 72535, 77898, 68418, 60609, 73703, 67298, 64375,
73550, 76887, 77538, 92233, 73267, 77779, 80634, 72736, 81475, 87595,
87386, 88874, 95145, 96991, 95186, 106122, 81173, 77941, 88576, 86372,
77850, 91188, 90547, 87803, 95264, 90054, 100544, 96302, 82402, 78297,
91847, 86235, 87557, 91139, 93116, 93855, 94172, 100003, 97051, 86785,
84849, 81682, 88273, 85645, 80121, 92187, 96409, 97609, 94971, 111356,
102049, 110838, 97596, 88747, 100882, 97801, 99312, 100163, 112241, 101667,
122227, 127548, 123216, 131987, 112248, 118140, 128127, 114529, 151671,
135476, 148513, 141155, 142314, 142144, 139774, 142715, 124773, 111401,
129554, 140624, 128378, 130208, 141051, 132299, 145779, 152341, 146552,
150930, 139732, 133423, 154363, 148374, 137392, 149258, 160086, 154738,
159570, 164496, 166885, 188369, 144181, 148121, 169758, 158890, 159699,
161691, 165828, 175617, 181875, 182883), .Tsp = c(1, 188, 1), class = "ts")

par(mfrow=c(3, 1), mar=c(3, 3, 2, 1), mgp=c(2, 0.6, 0), oma=c(0, 0, 0, 0))
plot(tf); acf(tf, main=""); pacf(tf, main="")

Fig.1

enter image description here

tf.d <- diff(tf)
plot(tf.d); acf(tf.d, main=""); pacf(tf.d, main="")

Fig.2

enter image description here

Now we can start interpreting the plots, starting with ACF. We can see that the correlation at lag 1 is negative, this can often happen when we take the difference of a series, and is termed 'overdifferencing'. From having a series with data points that were excessively similar to past data points, now they are excessively dissimilar from past data points, increasing fluctuations of period 1. From the plot we can see that the level is at roughly -0.3, which, while probably not large enough to cause issues, can be worth keeping in mind. Conventional wisdom also tells us that an ACF that dies of like this after 1 lag should be fitted with an MA(1) coefficient, my experience is that it's better to start with the AR coefficients, as they may make the MA coefficients redundant. But the opposite can also be the case, so an MA(1) should be among the candidate models.
Looking at the PACF plot we see two significant components at lags 1 and 2, which indicates an AR(2) model.
Looking further at the ACF and PACF plots we see hints of a wave-like feature and a positive peak at lag 12, which tells me that this is either monthly or bi-hourly data, and we have a seasonal component. Figuring out the seasonal components isn't very different from figuring out the non-seasonal components, we'll just describe correlations in terms of seasonal/periodical lags instead of sample lags.

tf.d12 <- ts(tf.d, f=12)
plot(tf.d12); acf(tf.d12, main="", lag.max=12*4); pacf(tf.d12, main="", lag.max=12*4)

Fig.3

enter image description here

Looking at the ACF plot we see strong components at lags 1, 2, 3, 4… periods, somewhat reminiscent of the first ACF plot, meaning we have a first order seasonal difference component, but instead of transforming the data again, we'll set D to 1 in what has now become a SARIMA model. As mentioned at the beginning, non-stationarity will obscure the signs of an MA process in the ACF plot, so we'll have to wait and see if any is needed.
In the PACF plot we can se significant components at 1, 2 and maybe 3 period lags, but in the spirit of parsimony we'll assume a SAR(2) will suffice.

The next step now is to fit all the models and evaluate ACF and PACF of the residuals. If we've been smart in selecting candidate models t here shouldn't be too many. Not knowing anything about the nature of the data, these three were chosen more or less arbitrarily.

ari1 <- arima(tf.d12, order=c(2, 0, 0), seasonal=c(2, 1, 0))
ari2 <- arima(tf.d12, order=c(1, 0, 1), seasonal=c(2, 1, 0))
ari3 <- arima(tf.d12, order=c(1, 0, 1), seasonal=c(2, 1, 1))

par(mfcol=c(3, 2), mar=c(3, 3, 2, 1), mgp=c(2, 0.6, 0), oma=c(0, 0, 1.5, 0))
acf(residuals(ari1), main="", ylim=c(-0.2, 1), lag.max=12*4)
mtext(ari1$call, 3, cex=0.8)
acf(residuals(ari2), main="", ylim=c(-0.2, 1), lag.max=12*4)
mtext(ari2$call, 3, cex=0.8)
acf(residuals(ari3), main="", ylim=c(-0.2, 1), lag.max=12*4)
mtext(ari3$call, 3, cex=0.8)
pacf(residuals(ari1), main="", ylim=c(-0.2, 0.44), lag.max=12*4)
mtext(ari1$call, 3, cex=0.8)
pacf(residuals(ari2), main="", ylim=c(-0.2, 0.44), lag.max=12*4)
mtext(ari2$call, 3, cex=0.8)
pacf(residuals(ari3), main="", ylim=c(-0.2, 0.44), lag.max=12*4)
mtext(ari3$call, 3, cex=0.8)
mtext("residual values", 3, outer=TRUE, cex=1.3)

Fig.4

enter image description here

Plots like these, empirical and theoretical knowledge about the data, application of information criteria (like AICc) and verification (like CV), will lead you to the appropriate model. Blindly trusting auto.arima() is no good.

Some further notes:

par(mfrow=c(1, 1), mar=c(3, 3, 2, 1), mgp=c(2, 0.6, 0), oma=c(0, 0, 0, 0))
plot(stl(tf.d12, "periodic"))

Fig.5

enter image description here

If we decompose tf.d12 we'll see that there is a slight trend remaining in the data. Adding a non-seasonal difference to the model might be appropriate:

arima(tf.d12, order=c(2, 1, 1), seasonal=c(2, 1, 1))

The decomposition also reveals what looks like a temporal component to the remainder, the magnitudes appear to increase with time. Our model does not deal with this.

tsoutliers::locate.outliers indicates one additive outlier at index 145 and a couple of temporal changes at 71 and 149.

Sorry that this got a bit long-winded, I started studying the data and couldn't stop. In the end this whole thing might be more suitable for Cross Validated, where there also is a lot of more knowledgable people who could give a second opinion.

Upvotes: 1

Alex Kinman
Alex Kinman

Reputation: 2605

For figuring out P, you use the PACF, not the ACF.

It is much easier though to just use the auto.arima function from the forecast package in R, which will automatically find the best p,q,d values for you.

Upvotes: 0

Related Questions