Cassidy
Cassidy

Reputation: 423

AOV functioning different when using date versus a factor

I'm running a simple ANOVA but when Year is a factor, the p-value doesn't show up. If I code Year as a numeric, the p-value shows up. I really want Year to be a factor instead of a continuous variable like a date.

dat <- structure(list(Year = structure(1:26, levels = c("1994", "1995", 
"1996", "1997", "1998", "1999", "2001", "2002", "2003", "2004", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016", "2017", "2018", "2019", "2021"
), class = "factor"), no = c(1, 1, 0, 1, 2, 3, 4, 14, 28, 0, 
2, 2, 6, 1, 0, 1, 0, 0, 0, 2, 5, 0, 4, 3, 0, 0), yes = c(3L, 
5L, 2L, 1L, 8L, 16L, 30L, 19L, 23L, 2L, 40L, 16L, 23L, 11L, 2L, 
5L, 9L, 1L, 2L, 12L, 7L, 5L, 8L, 10L, 6L, 5L), percentage = c(75, 
83.3333333333333, 100, 50, 80, 84.2105263157895, 88.2352941176471, 
57.5757575757576, 45.0980392156863, 100, 95.2380952380952, 88.8888888888889, 
79.3103448275862, 91.6666666666667, 100, 83.3333333333333, 100, 
100, 100, 85.7142857142857, 58.3333333333333, 100, 66.6666666666667, 
76.9230769230769, 100, 100), total = c(4, 6, 2, 2, 10, 19, 34, 
33, 51, 2, 42, 18, 29, 12, 2, 6, 9, 1, 2, 14, 12, 5, 12, 13, 
6, 5)), class = "data.frame", row.names = c(NA, -26L))

The data set above is with Year as a factor. Here is my output:

summary(aov(percentage ~ Year, data = dat)) # not significant
#            Df Sum Sq Mean Sq
#Year        25   7030   281.2

Any ideas would help!

Upvotes: 0

Views: 276

Answers (1)

Zheyuan Li
Zheyuan Li

Reputation: 73325

Since this is biological data, it's important to treat each year as individualized data so we figured a factor would be the best way.

head(dat)
#  Year no yes percentage total
#1 1994  1   3   75.00000     4
#2 1995  1   5   83.33333     6
#3 1996  0   2  100.00000     2
#4 1997  1   1   50.00000     2
#5 1998  2   8   80.00000    10
#6 1999  3  16   84.21053    19

There are plenty of issues here. I will talk about:

  • linear model percentage ~ Year with lm/aov;

  • logistic regression cbind(yes, no) ~ Year with glm.

What's wrong with using lm/aov

Statistically speaking, it is certainly not a good idea to model your manually calculated percentage (of "yes") using a linear regression. But there is an additional problem here: you only have one observation per year. It is certainly not reasonable to treat Year as a factor. In that way, you have as many regression coefficients as data, so that you are going to get a perfect fit with all residuals being 0. As a result, all test statistics and p-values will be NA or NaN.

lmfit <- aov(percentage ~ Year, data = dat)

## use `summary.lm()` for aov() fit to show coefficient table
summary.lm(fit)
#Call:
#aov(formula = percentage ~ Year, data = dat)
#
#Residuals:
#ALL 26 residuals are 0: no residual degrees of freedom!  ## dang!!!
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)   75.000        NaN     NaN      NaN
#Year1995       8.333        NaN     NaN      NaN
#Year1996      25.000        NaN     NaN      NaN
#Year1997     -25.000        NaN     NaN      NaN
#Year1998       5.000        NaN     NaN      NaN
#Year1999       9.211        NaN     NaN      NaN
#... trimmed; all NaN ...
#
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared:      1,    Adjusted R-squared:    NaN 
#F-statistic:   NaN on 25 and 0 DF,  p-value: NA

anova(lmfit)
#Analysis of Variance Table
#
#Response: percentage
#          Df Sum Sq Mean Sq F value Pr(>F)
#Year      25 7029.5  281.18     NaN    NaN
#Residuals  0    0.0     NaN               
#Warning message:
#In anova.lm(lmfit) :
#  ANOVA F-tests on an essentially perfect fit are unreliable  ## dang!!!

Switching to logistic regression with glm

In principle, we want a logistic regression below.

glmfit1 <- glm(cbind(yes, no) ~ Year, family = binomial(), data = dat)

However, there is still only one observation per Year, so you still get a perfect fit. In this case, the deviance residuals are all 0.

anova(glmfit1, test = "Chisq")
#Analysis of Deviance Table
#
#Model: binomial, link: logit
#
#Response: cbind(yes, no)
#
#Terms added sequentially (first to last)
#
#
#     Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
#NULL                    25     70.107             
#Year 25   70.107         0      0.000 3.71e-06 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I am not sure whether the p-value is valid here, given that we have 0 residual degree of freedom (i.e., a perfect fit).

One workaround (or maybe a cheating), is to transform your data to binary format.

Year_no <- with(dat, rep(Year, no))
Year_yes <- with(dat, rep(Year, yes))
fctr <- rep(c("no", "yes"), c(length(Year_no), length(Year_yes)))
fctr <- factor(fctr, levels = c("no", "yes"))
Year <- c(Year_no, Year_yes)
cheat <- data.frame(fctr = fctr, Year = Year)
rm(Year_no, Year_yes, fctr, Year)
head(cheat)
#  fctr Year
#1   no 1994
#2   no 1995
#3   no 1997
#4   no 1998
#5   no 1998
#6   no 1999

In this format, the sample size rises to 351, allowing you to treat Year as a factor without ending up with a perfect fit.

glmfit2 <- glm(fctr ~ Year, family = binomial(), data = cheat)

I thought these two specification of glm should give the same fit, but actually, they don't.

all.equal(glmfit1$coef, glmfit2$coef)
#[1] "Mean relative difference: 0.3250444"

Note that both model fitting have converged.

glmfit1$converged
#[1] TRUE

glmfit2$converged
#[1] TRUE

So I think this is really weird (could make a good question on https://stats.stackexchange.com/). But anyway, here is the ANOVA table.

anova(glmfit2, test = "Chisq")
#Analysis of Deviance Table
#
#Model: binomial, link: logit
#
#Response: fctr
#
#Terms added sequentially (first to last)
#
#
#     Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
#NULL                   350     376.80             
#Year 25   70.107       325     306.69 3.71e-06 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, the residual degree of freedom is 325. I tend to trust this result.

Closing Remark

This is what I can help on Stack Overflow. The next step for you should be seeking interpretation of these two logistic regression on https://stats.stackexchange.com/. Please share me with the question link, after you post the question there.

Upvotes: 2

Related Questions