Reputation: 261
i have many time series, each one representing a plant species. I think there is a pattern dependent on the woody density. High woody density species just flower between rain periods. Low woody density species flower in any periods.
With many species time series and measures of woody density, how do I model this with R to demonstrate this pattern?
Here is an example of what the data looks like:
#Woody Density
set.seed(69)
wden<-round(c(rnorm(5,mean=50),rnorm(5,mean=90)))
names(wden)<-c(paste("sp",1:10,sep=""))
wden
#Chuva
rain<-c(150,100,50,40,20,20,30,50,70,100,150,200,150,100,50,30,20,20,40,50,70,100,150,200)
#Flowering measures
ydet<-c(10,10,10,10,20,40,50,40,20,10,10,10)
#2 years for 5 low woody density and 5 high density species
flowering<-matrix(NA,nrow=24, ncol=10,dimnames=list(paste("month",1:24,sep=""),paste("sp",1:10,sep="")))
for (i in 1:5) {
flowering[,i]<-round(c(ydet+rnorm(12,mean=5,sd=5),ydet+rnorm(12,mean=5,sd=5)),digits=2)
}
for (i in 6:10) {
flowering[,i]<-round(c(rnorm(12,mean=30,sd=5),rnorm(12,mean=30,sd=5)),digits=2)
}
#Changing objects to Time series
flowering<-ts(flowering)
#Plot series
plot(flowering)
#Making colors for wood density
cores<-heat.colors(10,alpha=1)
matplot(c(1:24),flowering,type="l",lwd=2,lty=1,xlab="Time",ylab="Flowering",col=cores[order(wden)])
#Plotting Rain Together with time series
bargraph<-barplot(rain/max(rain)*100,xlab="Time",ylab="Rain")
matlines(bargraph,flowering,type="l",lwd=2,lty=1,xlab="Time",ylab="Flowering",col=cores[order(wden)])
axis(1,at=bargraph,labels=1:24)
axis(4,at=seq(0,100,by=10))
Upvotes: 3
Views: 1899
Reputation: 145
Obviously this is well overdue but in the last several years there have been some pretty great advancements in working with autocorrelated data. I'm doing the same at present, only binomial AND with missclassificaiton error (go big or go home, right?). Anyway, I've used Scott-Hawyard's code from the MRSea Package and Pirotta et al. 2011 to determine the significant predictors in autocorrelated temporal data. Below is how I approach it. I've written a custom model fitting function which is slow (to help avoid abuse and misuse) to identify meaningful predictors (via backwards QIC selection) as well as one that does the repeated Walds tests to determine significance. In this case, the forest type was, indeed significant and species was not.
For the blocking structure I've used species, it's probably more conservative than you need (forest type might suffice) but I tend to be conservative in my choices. Hope this helps and please, all cite Pirotta et al 2011 and the MRSea package (available on the CREEM website) if you use this code.
Also note that the anova function may not be appropriate. Pirotta used the anova.gee function from the geepack but that function seems to have disappeared from the newer versions and I have yet to find a substitute. Suggestions welcome.
### Functions
# This code calculates do backwards stepwise QIC to select the best model,
# then repeated Walds tests to retain meaningful variables and CalcAUC to caluclate
# the area under the ROC curve for each fitted model. All code based on
# Pirotta E, Matthiopoulos J, MacKenzie M, Scott-Hayward L, Rendell L Modelling sperm whale habitat preference: a novel approach combining transect and follow data
library(geepack)
library(splines)
library(AUC)
library(PresenceAbsence)
library(ROCR) # to build the ROC curve
library(mvtnorm) # for rmvnorm used in predictions/plotting
SelectModel=function(ModelFull){
# Calculate the QIC of the full model
fullmodQ=QIC(ModelFull)
newQIC=0
terms=attr(ModelFull$terms,"term.labels")
while(newQIC != fullmodQ & length(terms)>1){
# get all the terms for the full model
terms <- attr(ModelFull$terms,"term.labels")
n=length(terms)
newmodel=list()
newQIC=list()
newmodel[[1]]=ModelFull
newQIC[[1]]=fullmodQ
# Make n models with selection
for (ii in 1:n){
dropvar=terms[ii]
newTerms <- terms[-match(dropvar,terms)]
newform <- as.formula(paste(".~.-",dropvar))
newmodel[[ii+1]] <- update(ModelFull,newform)
newQIC[[ii+1]] =QIC(newmodel[[ii]])
}
# Get the model with the lowest QIC
LowestMod=which.min(unlist(newQIC))
if (LowestMod != 1){
ModelFull=newmodel[[LowestMod]]
newQIC=min(unlist(newQIC))
} else {
ModelFull=ModelFull
newQIC=min(unlist(newQIC))
}
#end the model selection
}
return(ModelFull)
}
DropVarsWalds=function(ModelFull){
# If no terms included return
if (length(attr(ModelFull$terms,"term.labels"))<2){
NewModel='No Covariates to select from'
}else{
OldModel=ModelFull
# Get the anova values
temp=anova(ModelFull)
# Make n models with selection
while(length(which(temp$`P(>|Chi|)`>.05))>0 & is.data.frame(temp)){
# get the maximum value
dropvar=rownames(temp)[which.max(temp$`P(>|Chi|)`)]
# new formula for the full model
newform <- as.formula(paste(".~.-",dropvar))
# new full model
ModelFull= update(ModelFull,newform)
# Get the model covariate names
terms <- attr(ModelFull$terms,"term.labels")
# # Get the anova values
# temp=anova(ModelFull)
temp=tryCatch({anova(ModelFull)}, error=function(e){e})
}
NewModel=ModelFull
}
return(NewModel)
}
# functions to produce partial plots (largely based on MRSea, Scott-Hawyard)
partialDF=function(mod, data, Variable){
coefpos <- c(1, grep(Variable, colnames(model.matrix(mod))))
xvals <- data[, which(names(data) == Variable)]
newX <- seq(min(xvals), max(xvals), length = 500)
eval(parse(text = paste(Variable, "<- newX",
sep = "")))
response <- rep(1, 500)
newBasis <- eval(parse(text = labels(terms(mod))[grep(Variable,
labels(terms(mod)))]))
partialfit <- cbind(rep(1, 500), newBasis) %*% coef(mod)[coefpos]
rcoefs <- NULL
try(rcoefs <- rmvnorm(1000, coef(mod), summary(mod)$cov.scaled),
silent = T)
if (is.null(rcoefs) || length(which(is.na(rcoefs) ==
T)) > 0) {
rcoefs <- rmvnorm(1000, coef(mod), as.matrix(nearPD(summary(mod)$cov.scaled)$mat))
}
rpreds <- cbind(rep(1, 500), newBasis) %*% t(rcoefs[,
coefpos])
quant.func <- function(x) {
quantile(x, probs = c(0.025, 0.975))
}
cis <- t(apply(rpreds, 1, quant.func))
partialfit <- mod$family$linkinv(partialfit)
cis <- mod$family$linkinv(cis)
fitdf=data.frame(x=newX, y=partialfit, LCI=cis[,1], UCI=cis[,2])
colnames(fitdf)[1]=Variable
return(fitdf)
}
partialdf_factor=function(mod, data, variable){
coeffac <- c(1,grep(variable, colnames(model.matrix(mod))))
coefradial <- c(grep("LocalRadialFunction", colnames(model.matrix(mod))))
coefpos <- coeffac[which(is.na(match(coeffac, coefradial)))]
xvals <- data[, which(names(data) == variable)]
newX <- sort(unique(xvals))
newX <- newX[2:length(newX)]
partialfit <- coef(mod)[c(coefpos)]
rcoefs <- NULL
try(rcoefs <- rmvnorm(1000, coef(mod), summary(mod)$cov.scaled),
silent = T)
if (is.null(rcoefs) || length(which(is.na(rcoefs) == T)) > 0) {
rcoefs <- rmvnorm(1000, coef(mod), as.matrix(nearPD(summary(mod)$cov.scaled)$mat))
}
if((length(coefpos))>1){
rpreds <- as.data.frame(rcoefs[, c(coefpos)])
fitdf_year=data.frame(vals=rpreds[,1])
fitdf_year$FactorVariable=as.factor(paste(variable, levels(xvals)[1], sep = ''))
############################
# Recompile for plotting #
#############################
for(jj in 2:ncol(rpreds)){
temp=data.frame(vals=rpreds[,jj])
temp$FactorVariable=as.factor(colnames(rpreds)[jj])
fitdf_year=rbind(fitdf_year, temp)
rm(temp)
}
}else{
rpreds <- rcoefs[,coefpos]
fitdf_year=data.frame(vals=rpreds)
fitdf_year$FactorVariable=colnames(model.matrix(mod))[coefpos[2:length(coefpos)]]
}
fitdf=fitdf_year
colnames(fitdf)[2]=variable
return(fitdf)
}
#Reshape your data
temp=data.frame(flowering)
flowering=data.frame(y=temp[,1], spp=colnames(temp)[1])
for(ii in 2:ncol(temp)){
flowering=rbind(flowering, data.frame(y=temp[,ii], spp=colnames(temp)[ii]))
}
flowering$rain=rep(rain, ncol(temp))
flowering$woods=as.factor(c(rep('dense',nrow(temp)*5), rep('light',nrow(temp)*5)))
fulmod=geeglm(formula=y~bs(rain)+woods,
family = gaussian,
id=spp,
data=flowering,
corstr = 'ar1')
# Use stepwise QIC to select the best model
QICmod=SelectModel(fulmod)
# Use repeated walds to select signficinat variables
sig_mod=DropVarsWalds(ModelFull = QICmod)
Upvotes: 0
Reputation: 227061
I might actually suggest you try this on http://stats.stackexchange.com, or on the [email protected]
mailing list. It's a little bit of a can of worms. The fundamental problem is that it's hard to prove that the association of two time series is causal, since (especially when both fluctuate regularly over time) it's easy for them to simply be fluctuating at about the same period and hence to appear to line up (the classic examples of this are the sunspot cycle and various totally unrelated time series like the New York stock exchange).
The classical approach to this would be to "whiten" each of the time series (you could fit a periodic spline or a sinusoidal model or just take differences of the pattern from the seasonal average) independently until each one is indistinguishable from white noise, then examine the cross-correlations among the time series (at lag zero, i.e. regular correlations, or at other lags to represent a leading/following pattern). In your case you would then presumably want to see how the cross-correlations varied in turn with wood density.
Alternatively, you could just lump the data into "rainy season" and "dry season" and do a more standard analysis (you get rid of most of the correlation by lumping), but (1) it would be nice to have an a priori division of seasons rather than doing it by looking at the data; (2) you might lose some power and/or interesting fine-scale patterns this way; (3) some of the basic inferential problems are still there -- is flowering associated with rain, or just with the rainy season?
Nice example, though.
Upvotes: 2