Reputation: 1
I am trying to fit a generalized linear model to my data. I collected Pollen from the air during two (unequal) time intervals: day (9 hours) and night (15 hours) and identified the pollen as Cherry, Other angiosperm, and Other gymnosperm, based on morphological characteristics.
When I look at the raw data there are some interesting-looking trends in the amount of Cherry pollen in the air at night compared to during the day, and these trends differ quite a bit between the two orchards I sampled. I want to measure how the time interval impacts my model.
My problems (non-exhaustive list :p)
1) I do not understand why the Interval term keeps giving me a value of 1 for Pr(>|z|). 2) I don't know how to account for the unequal time intervals.
I have tried this glm with two error distributions, poisson and binomial, depending on the format of the response variable.
Here is a sub-sample of my data:
Orchard <- c("CSO", "CSO", "CSO", "HBA", "HBA", "HBA")
Interval <- c("AM", "AM", "AM", "PM", "PM", "PM")
Interval.Duration <- c(9,9,9,15,15,15)
PollenType <- c("Cherry", "Other angiosperm", "Other gymnosperm")
Count <- c(0,2,11,245,124,5,0,2,19,80,38,0,1,0,3,200,150,1)
TotalCount <- c(13,13,13,374,374,374,21,21,21,118,118,118,4,4,4,351,351,351)
df <- data.frame(Orchard, Interval, Interval.Duration, PollenType, Count, TotalCount)
df
# Poisson error distribution model
mod <- glm(Count ~ PollenType + Interval + offset(log(TotalCount)), data = df, family = poisson)
summary(mod)
Call:
glm(formula = Count ~ PollenType + Interval + offset(log(TotalCount)),
family = poisson, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.0076 -3.0242 -0.9525 1.3507 8.8612
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.158e-01 1.646e-01 -3.134 0.00172 **
PollenTypeOther angiosperm -5.096e-01 7.117e-02 -7.159 8.1e-13 ***
PollenTypeOther gymnosperm -2.602e+00 1.660e-01 -15.677 < 2e-16 ***
IntervalPM -1.532e-14 1.658e-01 0.000 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 742.63 on 17 degrees of freedom
Residual deviance: 240.61 on 14 degrees of freedom
AIC: 313.04
Number of Fisher Scoring iterations: 7
# Binomial error distribution model
mod.bin <- glm(Count/TotalCount ~ PollenType + Interval, data = df, family = binomial)
summary(mod.bin)
Call:
glm(formula = Count/TotalCount ~ PollenType + Interval, family = binomial,
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.04298 -0.88408 0.03032 0.56505 1.02296
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.805e-01 9.912e-01 -0.586 0.558
PollenTypeOther angiosperm -6.754e-01 1.300e+00 -0.519 0.603
PollenTypeOther gymnosperm 2.558e-01 1.187e+00 0.216 0.829
IntervalPM 7.973e-14 1.016e+00 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.7044 on 17 degrees of freedom
Residual deviance: 9.1325 on 14 degrees of freedom
AIC: 28.299
Number of Fisher Scoring iterations: 4
The output of 1 for Interval tells me that I've done something wrong, or that I'm not asking my question properly in the glm.
I suspect the problem could have something to do with how my data set is formatted, but I really don't know where to start.
~~ EDIT ~~
Thanks @Eric-Scott for the detailed feedback. I reworked the dummy data to reflect only Cherry counts, and made subsets for each orchard because I don’t think it makes sense to include data from both orchards in the same model. I used the "Interval.Duration" as the offset as you recommended.
Orchard <- c("HBA", "CSO")
Interval <- c("AM", "AM", "PM","PM")
Interval.Duration <- c(9,15)
Count <- c(13,4,245,0,80,2,98,1,200,1,530,1,196,2,311,1)
TotalCount <- c(38,7,374,21,118,15,144,4,351,10,884,12,338,34,490,15)
df <- data.frame(Orchard, Interval, Interval.Duration, Count, TotalCount)
df
df.H <- subset(df, Orchard == "HBA")
df.C <- subset(df, Orchard == "CSO")
#HBA data
modH <- glm(Count ~ Interval + offset(log(Interval.Duration)), data = df.H, family = quasipoisson)
Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)),
family = quasipoisson, data = df.H)
Deviance Residuals:
Min 1Q Median 3Q Max
-13.392 -6.225 -1.096 6.204 12.226
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6088 0.4263 6.120 0.000869 ***
IntervalPM 0.8843 0.5067 1.745 0.131574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 88.85893)
Null deviance: 892.38 on 7 degrees of freedom
Residual deviance: 594.73 on 6 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
#CSO data
modC <- glm(Count ~ Interval + offset(log(Interval.Duration)), data = df.C, family = quasipoisson)
Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)),
family = quasipoisson, data = df.C)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.22474 -0.36170 0.05231 0.27453 1.05020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.8971 0.2400 -7.904 0.000218 ***
IntervalPM -1.0986 0.4801 -2.289 0.062070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 0.5185185)
Null deviance: 6.9044 on 7 degrees of freedom
Residual deviance: 3.7649 on 6 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
Using the dummy data, I can see overdispersion and underdispersion in the “HBA” and “CSO” models, respectively (although when I run the model on the full data set, there is overdispersion in both).
First, I fit a model with a Negative Binomial distribution, since I know that a zero-inflated model is not going to help with the overdispersion in the HBA data set (there aren't any zeros).
library(MASS)
mod.NB <- glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df)
summary(mod.NB)
Call:
glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)),
data = df, init.theta = 0.2822584211, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9557 -1.4501 -0.8920 0.3307 0.8569
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9258 0.6667 2.889 0.00387 **
IntervalPM 0.8753 0.9423 0.929 0.35294
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.2823) family taken to be 1)
Null deviance: 21.300 on 15 degrees of freedom
Residual deviance: 20.463 on 14 degrees of freedom
AIC: 166.75
Number of Fisher Scoring iterations: 1
Theta: 0.2823
Std. Err.: 0.0836
2 x log-likelihood: -160.7510
Following the advice to exponentiate the coefficients from https://rpubs.com/kaz_yos/pscl-2 I "exponentiated" (is that a word?) the coefficients from my model:
exp(coef(mod.NB))
(Intercept) IntervalPM
6.860548 2.399691
If I interpret this output correctly, the average number of Cherry pollen grains is 6.86, and at night, there is 2.40 times more Cherry pollen. Statistically speaking however, the increase in pollen at night is not statistically greater than during the day. (NB: Bear with me, most of my confusion stems from interpreting outputs.)
I ran a zero-inflated model to deal with the mess of zeros in the "CSO" data set, but it is not clear to me how to interpret the results here either.
library(pscl)
#HBA data set
modzH <- zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin")
There aren't any zeros in the HBA data set, so the following error pops up (hence the above NegBin glm).
Error in zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin") :
invalid dependent variable, minimum count is not zero
#CSO data
modzC <- zeroinfl(formula = Count ~ Interval, data = df.C, dist = "negbin")
summary(modzC)
Call:
zeroinfl(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df.C,
dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.8660 -0.3333 0.0610 0.2887 1.1666
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8971 0.3333 -5.691 1.26e-08 ***
IntervalPM -1.0986 0.6667 -1.648 0.0994 .
Log(theta) 13.6976 510.5736 0.027 0.9786
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -28.49 198393.43 0 1
IntervalPM 12.25 198394.44 0 1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 888750.384
Number of iterations in BFGS optimization: 23
Log-likelihood: -10.13 on 5 Df
On the advice of Ferran Paüls Vergés I again, exponentiated the coefficients from the model. (Is this because the model output gives coefficients in log-odds?)
## Exponentiated coefficients
expCoef <- exp(coef((modzC)))
expCoef
# count_(Intercept) count_IntervalPM zero_(Intercept) zero_IntervalPM
# 1.500015e-01 3.333271e-01 4.234422e-13 2.092359e+05
And (if I understand how to apply these) the baseline number of Cherry pollen in a sample that has pollen in it is 1.500015e-01 and the baseline odds of a sample having zero Cherry pollen grains is 4.234422e-13? I'm really uncertain about this sentence, it doesn't seem to make sense.
Upvotes: 0
Views: 1987
Reputation:
Your Count variable is definitely overdispersed, which means the poisson GLM may be giving you misleading results. I can see the overdispersion in a few ways: first, the residual deviance is a lot larger than degrees of freedom, second, if you run the model as a quasipoisson (https://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/) the dispersion parameter is 46, which is a lot larger than 1, which is assumed for a poisson distribution, and plotting a histogram gives a distribution with a long tail. You have a lot of zeroes and you may benefit from using a zero inflated model. Zero inflated models assume that the zeroes you got came from two sources: detection zeroes and "real" zeroes. That is, maybe there really wasn't any cherry pollen in that first observation, or maybe it was out there and you just didn't detect it.
I think you're on the right track with the first model. Offsets are usually used to account for differences in sampling effort. To me, the most natural candidate for the offset is interval.length
, but I can understand why you might use TotalCount
.
Looking at a boxplot, I think it seems reasonable that there is no main effect of interval (p = 1), but if you include the interaction between interval and pollen type, the interaction is significant. If you are only interested in Cherry pollen and you're just using the other pollen types to get the TotalPollen for the offset, then filter the df so it's only cherry and drop PollenType
from the model and I bet you'll see an effect of interval.
mod <- glm(Count ~ PollenType + Interval, offset = log(TotalCount), data = df, family = poisson)
car::Anova(mod)
mod2 <- glm(Count ~ PollenType*Interval, offset = log(TotalCount), data = df, family = quasipoisson)
car::Anova(mod2)
But the first thing I would do is try dealing with the overdispersion.
Response to Edit
First off, you said you'd do separate models for each orchard, but you didn't (data = df
). If you do use data = df
then you can include orchard as a factor and you get a significant interval:orchard interaction, which is maybe interesting to you? (try looking at ggplot(df, aes(x = Interval, y = log(Count), color = Orchard)) + geom_boxplot()
).
If you separate them, you'll have less power overall, but if you're not interested in differences between orchards, it sounds like you're justified in doing that. You could also use all the data and treat orchard as a random effect in a glmer model, but that's adding another layer of complexity.
I'd try and keep things simple by applying the same sort of model to both orchards. The fact that there are no zeros in one of your orchards makes me think that there isn't really two sources of zeroes in the other orchard. That is, the zeros you have are "real" and indicate an absence of pollen, not just a failure to detect pollen. So maybe not zero inflated after all? You'll have to judge for yourself if you have more zeros than you'd expect under a negative binomial distribution. Plus, as you noticed, interpreting zero-inflated models is notoriously difficult.
As for interpreting the outputs of the negative binomial glm, I'd personally avoid looking at summary()
too much. You can use it to get coefficients (or you can use coef()
), but to test for the significance of Interval, use either car::Anova(mod.NB)
or create a null model (mod.NB.0 <- glm.nb(formula = Count ~ 1 + offset(log(Interval.Duration)), data = df)
) and use lmtest::lrtest(mod.NB, mod.NB.0)
.
Upvotes: 0