Augusto Ribas
Augusto Ribas

Reputation: 261

How to test difference among several time series using R

i have many time series, each one representing a plant species. I think there is a pattern dependent on the woody density. High woody density species just flower between rain periods. Low woody density species flower in any periods.

With many species time series and measures of woody density, how do I model this with R to demonstrate this pattern?

Here is an example of what the data looks like:

#Woody Density
set.seed(69)
wden<-round(c(rnorm(5,mean=50),rnorm(5,mean=90)))
names(wden)<-c(paste("sp",1:10,sep=""))
wden

#Chuva
rain<-c(150,100,50,40,20,20,30,50,70,100,150,200,150,100,50,30,20,20,40,50,70,100,150,200)


#Flowering measures
ydet<-c(10,10,10,10,20,40,50,40,20,10,10,10)

#2 years for 5 low woody density and 5 high density species
flowering<-matrix(NA,nrow=24, ncol=10,dimnames=list(paste("month",1:24,sep=""),paste("sp",1:10,sep="")))
for (i in 1:5) {
               flowering[,i]<-round(c(ydet+rnorm(12,mean=5,sd=5),ydet+rnorm(12,mean=5,sd=5)),digits=2)
               }
for (i in 6:10) {
               flowering[,i]<-round(c(rnorm(12,mean=30,sd=5),rnorm(12,mean=30,sd=5)),digits=2)
               }
#Changing objects to Time series
flowering<-ts(flowering)
#Plot series
plot(flowering)

#Making colors for wood density
cores<-heat.colors(10,alpha=1)
matplot(c(1:24),flowering,type="l",lwd=2,lty=1,xlab="Time",ylab="Flowering",col=cores[order(wden)])

#Plotting Rain Together with time series
bargraph<-barplot(rain/max(rain)*100,xlab="Time",ylab="Rain")
matlines(bargraph,flowering,type="l",lwd=2,lty=1,xlab="Time",ylab="Flowering",col=cores[order(wden)])
axis(1,at=bargraph,labels=1:24)
axis(4,at=seq(0,100,by=10))

Upvotes: 3

Views: 1899

Answers (2)

K.J. Palmer
K.J. Palmer

Reputation: 145

Obviously this is well overdue but in the last several years there have been some pretty great advancements in working with autocorrelated data. I'm doing the same at present, only binomial AND with missclassificaiton error (go big or go home, right?). Anyway, I've used Scott-Hawyard's code from the MRSea Package and Pirotta et al. 2011 to determine the significant predictors in autocorrelated temporal data. Below is how I approach it. I've written a custom model fitting function which is slow (to help avoid abuse and misuse) to identify meaningful predictors (via backwards QIC selection) as well as one that does the repeated Walds tests to determine significance. In this case, the forest type was, indeed significant and species was not.

For the blocking structure I've used species, it's probably more conservative than you need (forest type might suffice) but I tend to be conservative in my choices. Hope this helps and please, all cite Pirotta et al 2011 and the MRSea package (available on the CREEM website) if you use this code.

Also note that the anova function may not be appropriate. Pirotta used the anova.gee function from the geepack but that function seems to have disappeared from the newer versions and I have yet to find a substitute. Suggestions welcome.

### Functions  

# This code  calculates do backwards stepwise QIC to select the best model,
# then repeated Walds tests to retain meaningful variables and CalcAUC to caluclate
# the area under the ROC curve for each fitted model. All code based on
# Pirotta E, Matthiopoulos J, MacKenzie M, Scott-Hayward L, Rendell L Modelling sperm whale habitat preference: a novel approach combining transect and follow data
library(geepack)
library(splines)
library(AUC)
library(PresenceAbsence)
library(ROCR)            # to build the ROC curve
library(mvtnorm)         # for rmvnorm used in predictions/plotting


SelectModel=function(ModelFull){

  # Calculate the QIC of the full model
  fullmodQ=QIC(ModelFull)
  newQIC=0
  terms=attr(ModelFull$terms,"term.labels")


  while(newQIC != fullmodQ & length(terms)>1){


    # get all the terms for the full model
    terms <- attr(ModelFull$terms,"term.labels")
    n=length(terms)

    newmodel=list()
    newQIC=list()

    newmodel[[1]]=ModelFull
    newQIC[[1]]=fullmodQ

    # Make n models with selection
    for (ii in 1:n){
      dropvar=terms[ii]
      newTerms <- terms[-match(dropvar,terms)]
      newform <- as.formula(paste(".~.-",dropvar))
      newmodel[[ii+1]] <- update(ModelFull,newform)
      newQIC[[ii+1]] =QIC(newmodel[[ii]])

    }

    # Get the model with the lowest QIC
    LowestMod=which.min(unlist(newQIC))

    if (LowestMod != 1){
      ModelFull=newmodel[[LowestMod]]
      newQIC=min(unlist(newQIC))
    } else {
      ModelFull=ModelFull
      newQIC=min(unlist(newQIC))
    }


    #end the model selection


  }
  return(ModelFull)

}

DropVarsWalds=function(ModelFull){

  # If no terms included return 
  if (length(attr(ModelFull$terms,"term.labels"))<2){
    NewModel='No Covariates to select from'

  }else{


    OldModel=ModelFull
    # Get the anova values
    temp=anova(ModelFull)

    # Make n models with selection
    while(length(which(temp$`P(>|Chi|)`>.05))>0 & is.data.frame(temp)){


      # get the maximum value
      dropvar=rownames(temp)[which.max(temp$`P(>|Chi|)`)]

      # new formula for the full model
      newform <- as.formula(paste(".~.-",dropvar))

      # new full model
      ModelFull= update(ModelFull,newform) 

      # Get the model covariate names
      terms <- attr(ModelFull$terms,"term.labels")

      # # Get the anova values
      # temp=anova(ModelFull)

      temp=tryCatch({anova(ModelFull)}, error=function(e){e})


    }

    NewModel=ModelFull
  }

  return(NewModel)
}


# functions to produce partial plots (largely based on MRSea, Scott-Hawyard)
partialDF=function(mod, data, Variable){

  coefpos <- c(1, grep(Variable, colnames(model.matrix(mod))))
  xvals <- data[, which(names(data) == Variable)]
  newX <- seq(min(xvals), max(xvals), length = 500)
  eval(parse(text = paste(Variable, "<- newX", 
                          sep = "")))
  response <- rep(1, 500)
  newBasis <- eval(parse(text = labels(terms(mod))[grep(Variable, 
                                                        labels(terms(mod)))]))
  partialfit <- cbind(rep(1, 500), newBasis) %*% coef(mod)[coefpos]
  rcoefs <- NULL
  try(rcoefs <- rmvnorm(1000, coef(mod), summary(mod)$cov.scaled), 
      silent = T)
  if (is.null(rcoefs) || length(which(is.na(rcoefs) == 
                                      T)) > 0) {
    rcoefs <- rmvnorm(1000, coef(mod), as.matrix(nearPD(summary(mod)$cov.scaled)$mat))
  }
  rpreds <- cbind(rep(1, 500), newBasis) %*% t(rcoefs[, 
                                                      coefpos])
  quant.func <- function(x) {
    quantile(x, probs = c(0.025, 0.975))
  }
  cis <- t(apply(rpreds, 1, quant.func))

  partialfit <- mod$family$linkinv(partialfit)
  cis <- mod$family$linkinv(cis)

  fitdf=data.frame(x=newX, y=partialfit, LCI=cis[,1], UCI=cis[,2]) 
  colnames(fitdf)[1]=Variable
  return(fitdf)
}

partialdf_factor=function(mod, data, variable){
  coeffac <- c(1,grep(variable, colnames(model.matrix(mod))))
  coefradial <- c(grep("LocalRadialFunction", colnames(model.matrix(mod))))
  coefpos <- coeffac[which(is.na(match(coeffac, coefradial)))]
  xvals <- data[, which(names(data) == variable)]
  newX <- sort(unique(xvals))
  newX <- newX[2:length(newX)]
  partialfit <- coef(mod)[c(coefpos)]
  rcoefs <- NULL
  try(rcoefs <- rmvnorm(1000, coef(mod), summary(mod)$cov.scaled), 
      silent = T)
  if (is.null(rcoefs) || length(which(is.na(rcoefs) == T)) > 0) {
    rcoefs <- rmvnorm(1000, coef(mod), as.matrix(nearPD(summary(mod)$cov.scaled)$mat))
  }

  if((length(coefpos))>1){

    rpreds <- as.data.frame(rcoefs[, c(coefpos)])
    fitdf_year=data.frame(vals=rpreds[,1])
    fitdf_year$FactorVariable=as.factor(paste(variable, levels(xvals)[1], sep = ''))

    ############################
    # Recompile for plotting #
    #############################


    for(jj in 2:ncol(rpreds)){
      temp=data.frame(vals=rpreds[,jj])
      temp$FactorVariable=as.factor(colnames(rpreds)[jj])
      fitdf_year=rbind(fitdf_year, temp)
      rm(temp)
    }

  }else{

    rpreds <- rcoefs[,coefpos]
    fitdf_year=data.frame(vals=rpreds)
    fitdf_year$FactorVariable=colnames(model.matrix(mod))[coefpos[2:length(coefpos)]]

  }

  fitdf=fitdf_year
  colnames(fitdf)[2]=variable

  return(fitdf)


}


#Reshape your data

temp=data.frame(flowering)
flowering=data.frame(y=temp[,1], spp=colnames(temp)[1])

for(ii in 2:ncol(temp)){

  flowering=rbind(flowering, data.frame(y=temp[,ii], spp=colnames(temp)[ii]))

}

flowering$rain=rep(rain, ncol(temp))
flowering$woods=as.factor(c(rep('dense',nrow(temp)*5), rep('light',nrow(temp)*5)))

fulmod=geeglm(formula=y~bs(rain)+woods, 
              family = gaussian, 
              id=spp, 
              data=flowering, 
              corstr = 'ar1')

# Use stepwise QIC to select the best model
QICmod=SelectModel(fulmod)

# Use repeated walds to select signficinat variables
sig_mod=DropVarsWalds(ModelFull = QICmod)

Upvotes: 0

Ben Bolker
Ben Bolker

Reputation: 227061

I might actually suggest you try this on http://stats.stackexchange.com, or on the [email protected] mailing list. It's a little bit of a can of worms. The fundamental problem is that it's hard to prove that the association of two time series is causal, since (especially when both fluctuate regularly over time) it's easy for them to simply be fluctuating at about the same period and hence to appear to line up (the classic examples of this are the sunspot cycle and various totally unrelated time series like the New York stock exchange).

The classical approach to this would be to "whiten" each of the time series (you could fit a periodic spline or a sinusoidal model or just take differences of the pattern from the seasonal average) independently until each one is indistinguishable from white noise, then examine the cross-correlations among the time series (at lag zero, i.e. regular correlations, or at other lags to represent a leading/following pattern). In your case you would then presumably want to see how the cross-correlations varied in turn with wood density.

Alternatively, you could just lump the data into "rainy season" and "dry season" and do a more standard analysis (you get rid of most of the correlation by lumping), but (1) it would be nice to have an a priori division of seasons rather than doing it by looking at the data; (2) you might lose some power and/or interesting fine-scale patterns this way; (3) some of the basic inferential problems are still there -- is flowering associated with rain, or just with the rainy season?

Nice example, though.

Upvotes: 2

Related Questions