Reputation: 874
I have a dataset with data from thousands of individuals with measurement of a parameter X measured yearly the last 9 years.
Basicly they are in a dataframe df
id,year,x,feature
A,2016,376,female
A,2015,391,female
A,2014,376,female
A,2013,373,female
A,2012,347,female
A,2011,330,female
B,2016,398,male
B,2015,391,male
B,2014,410,male
B,2013,393,male
B,2012,408,male
B,2011,288,male
C,2016,2464,male
C,2015,2465,male
C,2014,2500,male
C,2013,2215,male
C,2012,2228,male
C,2011,1839,male
etc.
I want to estimate different models on these timeseries
like predict(x(t)) = f(x(t-1),x(t-2),...,x(t-n),feature, id (taken as a random factor))
I can see how to use ts for autoregressive modelling but it will calculate thosands of indvidual models, and I want a global prediction (with its inherent problems) based on the time history and the features.
lm is not a good idea since the data is highly autocorrelated. Any good ideas?
Upvotes: 4
Views: 6171
Reputation: 11
The statement about the function f()
arises many choices.
However, within the linear class, you can use vector generalized linear models (via vglm()) to fit generalized linear models with ARMA (or GARCH) structures, incorporating covariates.
For example, assuming the (presupposed) random errors are normally distributed, you can use the family function ARff()
from package VGAMextra
, as follows.
The second option, however, uses the non-parametric version, i.e., VGAMs, via smart prediction. The only drawback is that vglms/vgams do not handle random effects.
library(VGAM)
library(VGAMextra)
# Fitting a linear model to the mean of the normal distribution
# allowing an AR(3) struture. Use the modelling function vglm() and
# the family functions ARff()
df.read <- DF # DF as given by G.G.
fit.Lines <- vglm(x ~ feature , ARff(order = 3,
zero = c("Var", "ARcoeff")),
data = df.read, trace = TRUE)
coef(fit.Lines, matrix = TRUE)
summary(fit.Lines, HD = FALSE)
with(df.read, plot(fitted.values(fit.Lines) ~ year,
ylim = c(0, 3000),
pch = 19, col = as.factor(feature)))
# Using VGAMs, here, the family function uninormal() is utilized.
#
df.read2 <- data.frame(embed(df.read$x, 4))
names(df.read2) <- c("x", "xLag1", "xLag2", "xLag3")
df.read2 <- transform(df.read2, year = df.read$year[-c(1:3)],
feature = df.read$feature[-c(1:3)])
fit.Lines.vgams <- vgam(x ~ sm.bs(xLag1) + sm.bs(xLag2) +
sm.bs(xLag3) + feature + year,
uninormal, data = df.read2, trace = TRUE)
with(df.read2, plot(fitted.values(fit.Lines.vgams) ~ year,
ylim = c(0, 3000),
pch = 19, col = as.factor(feature)))
Upvotes: 1
Reputation: 270010
There are many possible models but here is a mixed effects model with AR1 structure that you can try.
library(nlme)
fm <- lme(x ~ year + feature, random = ~ year | id, DF,
correlation = corAR1(form = ~ year | id))
summary(fm)
and here is a plot of the data:
library(ggplot2)
ggplot(DF, aes(year, x, group = id, col = feature)) + geom_line() + geom_point()
Note: We have assumed this input data:
Lines <- "
id,year,x,feature
A,2016,376,female
A,2015,391,female
A,2014,376,female
A,2013,373,female
A,2012,347,female
A,2011,330,female
B,2016,398,male
B,2015,391,male
B,2014,410,male
B,2013,393,male
B,2012,408,male
B,2011,288,male
C,2016,2464,male
C,2015,2465,male
C,2014,2500,male
C,2013,2215,male
C,2012,2228,male
C,2011,1839,male"
library(zoo)
DF <- read.csv(text = Lines, strip.white = TRUE)
Upvotes: 4