Francis Smart
Francis Smart

Reputation: 4055

R: Simple anova with MANY factor values

I am attempting a simple anova with many factor values 7000+. Upon attempting to use the aov command, my computer is getting extremely bogged down (never finishing). I don't understand why this should be the case since manually attempting to program my own anova seems to run much quicker. The challenge though is that my anova is diverging from the base anova on small sample sizes and I am not sure why exactly (see below).

My question to request if anyone has options for a faster anova. Also, if someone is paricularly inspired I have attached my "anova". If anybody is so inspired, I would greatly appreciate any feedback to see if I made mistakes on writing it. My anova seems to run 10-100x faster than the base. But then again perhaps I am making some mistakes.

anova <- function(X, ID, na.rm=TRUE) {
  N <- sum(!is.na(X))
  K <- length(unique(ID))

  Xbari <- ave(X, ID, FUN=function(x) mean(x, na.rm=na.rm))
  Xbar  <- mean(X, na.rm=na.rm)
  ESS <- sum((Xbar-Xbari)^2, na.rm=na.rm)
  USS <- sum((X-Xbari)^2,   na.rm=na.rm)

  TDF <- (K-1)
  DFi <- (N-K)
  fstat = (ESS/TDF)/(USS/DFi)

  data.frame(p=pf(fstat, TDF, DFi),ESS=ESS,
             USS=USS,TDF=TDF, DFi=DFi, fstat=fstat)
}

Upvotes: 0

Views: 234

Answers (1)

Roland
Roland

Reputation: 132576

aov is basically a wrapper for lm. One of the first steps lm does is creating the design matrix, i.e., transforming your factor variable into dummy variables, i.e., your factor of length n is transformed to a matrix of dimensions n * 7000. A least squares fit with more than 7000 variables can be expected to be slow.

The calculation of the p-value if not correct in your anova function. Try this:

anova <- function(X, ID, na.rm=TRUE) {
  stopifnot(length(X) == length(ID))
  nas <- is.na(X) | is.na(ID)
  stopifnot(!any(nas) & na.rm)
  X <- X[!nas]
  ID <- ID[!nas]

  N <- length(X)
  K <- length(unique(ID))

  Xbari <- ave(X, ID, FUN=mean)

  Xbar  <- mean(X)
  ESS <- sum((Xbar-Xbari)^2)
  USS <- sum((X-Xbari)^2)

  TDF <- (K-1)
  DFi <- (N-K)
  fstat = (ESS/TDF)/(USS/DFi)

  data.frame(p=format.pval(pf(fstat, TDF, DFi, lower.tail = FALSE)),ESS=ESS,
             USS=USS,TDF=TDF, DFi=DFi, fstat=fstat)
}

anova(npk$yield, npk$block)
#         p     ESS    USS TDF DFi    fstat
#1 0.086072 343.295 533.07   5  18 2.318386

summary(aov(yield ~ block, data = npk))
#            Df Sum Sq Mean Sq F value Pr(>F)  
#block        5  343.3   68.66   2.318 0.0861 .
#Residuals   18  533.1   29.61                 
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This could be made faster by using data.table or dplyr to calculate the group means.

Upvotes: 1

Related Questions