Reputation: 423
I'm running a simple ANOVA but when Year
is a factor, the p-value doesn't show up. If I code Year
as a numeric, the p-value shows up. I really want Year
to be a factor instead of a continuous variable like a date.
dat <- structure(list(Year = structure(1:26, levels = c("1994", "1995",
"1996", "1997", "1998", "1999", "2001", "2002", "2003", "2004",
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016", "2017", "2018", "2019", "2021"
), class = "factor"), no = c(1, 1, 0, 1, 2, 3, 4, 14, 28, 0,
2, 2, 6, 1, 0, 1, 0, 0, 0, 2, 5, 0, 4, 3, 0, 0), yes = c(3L,
5L, 2L, 1L, 8L, 16L, 30L, 19L, 23L, 2L, 40L, 16L, 23L, 11L, 2L,
5L, 9L, 1L, 2L, 12L, 7L, 5L, 8L, 10L, 6L, 5L), percentage = c(75,
83.3333333333333, 100, 50, 80, 84.2105263157895, 88.2352941176471,
57.5757575757576, 45.0980392156863, 100, 95.2380952380952, 88.8888888888889,
79.3103448275862, 91.6666666666667, 100, 83.3333333333333, 100,
100, 100, 85.7142857142857, 58.3333333333333, 100, 66.6666666666667,
76.9230769230769, 100, 100), total = c(4, 6, 2, 2, 10, 19, 34,
33, 51, 2, 42, 18, 29, 12, 2, 6, 9, 1, 2, 14, 12, 5, 12, 13,
6, 5)), class = "data.frame", row.names = c(NA, -26L))
The data set above is with Year as a factor. Here is my output:
summary(aov(percentage ~ Year, data = dat)) # not significant
# Df Sum Sq Mean Sq
#Year 25 7030 281.2
Any ideas would help!
Upvotes: 0
Views: 276
Reputation: 73325
Since this is biological data, it's important to treat each year as individualized data so we figured a factor would be the best way.
head(dat)
# Year no yes percentage total
#1 1994 1 3 75.00000 4
#2 1995 1 5 83.33333 6
#3 1996 0 2 100.00000 2
#4 1997 1 1 50.00000 2
#5 1998 2 8 80.00000 10
#6 1999 3 16 84.21053 19
There are plenty of issues here. I will talk about:
linear model percentage ~ Year
with lm
/aov
;
logistic regression cbind(yes, no) ~ Year
with glm
.
lm
/aov
Statistically speaking, it is certainly not a good idea to model your manually calculated percentage (of "yes") using a linear regression. But there is an additional problem here: you only have one observation per year. It is certainly not reasonable to treat Year
as a factor. In that way, you have as many regression coefficients as data, so that you are going to get a perfect fit with all residuals being 0. As a result, all test statistics and p-values will be NA
or NaN
.
lmfit <- aov(percentage ~ Year, data = dat)
## use `summary.lm()` for aov() fit to show coefficient table
summary.lm(fit)
#Call:
#aov(formula = percentage ~ Year, data = dat)
#
#Residuals:
#ALL 26 residuals are 0: no residual degrees of freedom! ## dang!!!
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 75.000 NaN NaN NaN
#Year1995 8.333 NaN NaN NaN
#Year1996 25.000 NaN NaN NaN
#Year1997 -25.000 NaN NaN NaN
#Year1998 5.000 NaN NaN NaN
#Year1999 9.211 NaN NaN NaN
#... trimmed; all NaN ...
#
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: NaN
#F-statistic: NaN on 25 and 0 DF, p-value: NA
anova(lmfit)
#Analysis of Variance Table
#
#Response: percentage
# Df Sum Sq Mean Sq F value Pr(>F)
#Year 25 7029.5 281.18 NaN NaN
#Residuals 0 0.0 NaN
#Warning message:
#In anova.lm(lmfit) :
# ANOVA F-tests on an essentially perfect fit are unreliable ## dang!!!
glm
In principle, we want a logistic regression below.
glmfit1 <- glm(cbind(yes, no) ~ Year, family = binomial(), data = dat)
However, there is still only one observation per Year
, so you still get a perfect fit. In this case, the deviance residuals are all 0.
anova(glmfit1, test = "Chisq")
#Analysis of Deviance Table
#
#Model: binomial, link: logit
#
#Response: cbind(yes, no)
#
#Terms added sequentially (first to last)
#
#
# Df Deviance Resid. Df Resid. Dev Pr(>Chi)
#NULL 25 70.107
#Year 25 70.107 0 0.000 3.71e-06 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I am not sure whether the p-value is valid here, given that we have 0 residual degree of freedom (i.e., a perfect fit).
One workaround (or maybe a cheating), is to transform your data to binary format.
Year_no <- with(dat, rep(Year, no))
Year_yes <- with(dat, rep(Year, yes))
fctr <- rep(c("no", "yes"), c(length(Year_no), length(Year_yes)))
fctr <- factor(fctr, levels = c("no", "yes"))
Year <- c(Year_no, Year_yes)
cheat <- data.frame(fctr = fctr, Year = Year)
rm(Year_no, Year_yes, fctr, Year)
head(cheat)
# fctr Year
#1 no 1994
#2 no 1995
#3 no 1997
#4 no 1998
#5 no 1998
#6 no 1999
In this format, the sample size rises to 351, allowing you to treat Year
as a factor without ending up with a perfect fit.
glmfit2 <- glm(fctr ~ Year, family = binomial(), data = cheat)
I thought these two specification of glm
should give the same fit, but actually, they don't.
all.equal(glmfit1$coef, glmfit2$coef)
#[1] "Mean relative difference: 0.3250444"
Note that both model fitting have converged.
glmfit1$converged
#[1] TRUE
glmfit2$converged
#[1] TRUE
So I think this is really weird (could make a good question on https://stats.stackexchange.com/). But anyway, here is the ANOVA table.
anova(glmfit2, test = "Chisq")
#Analysis of Deviance Table
#
#Model: binomial, link: logit
#
#Response: fctr
#
#Terms added sequentially (first to last)
#
#
# Df Deviance Resid. Df Resid. Dev Pr(>Chi)
#NULL 350 376.80
#Year 25 70.107 325 306.69 3.71e-06 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case, the residual degree of freedom is 325. I tend to trust this result.
This is what I can help on Stack Overflow. The next step for you should be seeking interpretation of these two logistic regression on https://stats.stackexchange.com/. Please share me with the question link, after you post the question there.
Upvotes: 2