CharlieB
CharlieB

Reputation: 1

Why am I getting a value of 1 for Pr(>|z|) in the output of my glm?

I am trying to fit a generalized linear model to my data. I collected Pollen from the air during two (unequal) time intervals: day (9 hours) and night (15 hours) and identified the pollen as Cherry, Other angiosperm, and Other gymnosperm, based on morphological characteristics.

When I look at the raw data there are some interesting-looking trends in the amount of Cherry pollen in the air at night compared to during the day, and these trends differ quite a bit between the two orchards I sampled. I want to measure how the time interval impacts my model.

My problems (non-exhaustive list :p)

1) I do not understand why the Interval term keeps giving me a value of 1 for Pr(>|z|). 2) I don't know how to account for the unequal time intervals.

I have tried this glm with two error distributions, poisson and binomial, depending on the format of the response variable.

Here is a sub-sample of my data:

Orchard <- c("CSO", "CSO", "CSO", "HBA", "HBA", "HBA")
Interval <- c("AM", "AM", "AM", "PM", "PM", "PM")
Interval.Duration <- c(9,9,9,15,15,15)
PollenType <- c("Cherry", "Other angiosperm", "Other gymnosperm")
Count <- c(0,2,11,245,124,5,0,2,19,80,38,0,1,0,3,200,150,1)
TotalCount <- c(13,13,13,374,374,374,21,21,21,118,118,118,4,4,4,351,351,351)

df <- data.frame(Orchard, Interval, Interval.Duration, PollenType, Count, TotalCount)
df

# Poisson error distribution model
mod <- glm(Count ~ PollenType + Interval + offset(log(TotalCount)), data = df, family = poisson)
summary(mod) 

Call:
glm(formula = Count ~ PollenType + Interval + offset(log(TotalCount)), 
    family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0076  -3.0242  -0.9525   1.3507   8.8612  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -5.158e-01  1.646e-01  -3.134  0.00172 ** 
PollenTypeOther angiosperm -5.096e-01  7.117e-02  -7.159  8.1e-13 ***
PollenTypeOther gymnosperm -2.602e+00  1.660e-01 -15.677  < 2e-16 ***
IntervalPM                 -1.532e-14  1.658e-01   0.000  1.00000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 742.63  on 17  degrees of freedom
Residual deviance: 240.61  on 14  degrees of freedom
AIC: 313.04

Number of Fisher Scoring iterations: 7


# Binomial error distribution model
mod.bin <- glm(Count/TotalCount ~ PollenType + Interval, data = df, family = binomial)
summary(mod.bin)


Call:
glm(formula = Count/TotalCount ~ PollenType + Interval, family = binomial, 
    data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.04298  -0.88408   0.03032   0.56505   1.02296  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)
(Intercept)                -5.805e-01  9.912e-01  -0.586    0.558
PollenTypeOther angiosperm -6.754e-01  1.300e+00  -0.519    0.603
PollenTypeOther gymnosperm  2.558e-01  1.187e+00   0.216    0.829
IntervalPM                  7.973e-14  1.016e+00   0.000    1.000

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.7044  on 17  degrees of freedom
Residual deviance: 9.1325  on 14  degrees of freedom
AIC: 28.299

Number of Fisher Scoring iterations: 4

The output of 1 for Interval tells me that I've done something wrong, or that I'm not asking my question properly in the glm.

I suspect the problem could have something to do with how my data set is formatted, but I really don't know where to start.

~~ EDIT ~~

Thanks @Eric-Scott for the detailed feedback. I reworked the dummy data to reflect only Cherry counts, and made subsets for each orchard because I don’t think it makes sense to include data from both orchards in the same model. I used the "Interval.Duration" as the offset as you recommended.

Orchard <- c("HBA", "CSO")
Interval <- c("AM", "AM", "PM","PM")
Interval.Duration <- c(9,15)
Count <- c(13,4,245,0,80,2,98,1,200,1,530,1,196,2,311,1)
TotalCount <- c(38,7,374,21,118,15,144,4,351,10,884,12,338,34,490,15)

df <- data.frame(Orchard, Interval, Interval.Duration, Count, TotalCount)
df
df.H <- subset(df, Orchard == "HBA")
df.C <- subset(df, Orchard == "CSO")

#HBA data
modH <- glm(Count ~ Interval + offset(log(Interval.Duration)),  data = df.H, family = quasipoisson)

Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    family = quasipoisson, data = df.H)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-13.392   -6.225   -1.096    6.204   12.226  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.6088     0.4263   6.120 0.000869 ***
IntervalPM    0.8843     0.5067   1.745 0.131574    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 88.85893)

    Null deviance: 892.38  on 7  degrees of freedom
Residual deviance: 594.73  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

#CSO data
modC <- glm(Count ~ Interval + offset(log(Interval.Duration)), data = df.C, family = quasipoisson)

Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    family = quasipoisson, data = df.C)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.22474  -0.36170   0.05231   0.27453   1.05020  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.8971     0.2400  -7.904 0.000218 ***
IntervalPM   -1.0986     0.4801  -2.289 0.062070 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.5185185)

    Null deviance: 6.9044  on 7  degrees of freedom
Residual deviance: 3.7649  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Using the dummy data, I can see overdispersion and underdispersion in the “HBA” and “CSO” models, respectively (although when I run the model on the full data set, there is overdispersion in both).

First, I fit a model with a Negative Binomial distribution, since I know that a zero-inflated model is not going to help with the overdispersion in the HBA data set (there aren't any zeros).

library(MASS)
mod.NB <- glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df)
summary(mod.NB)

Call:
glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    data = df, init.theta = 0.2822584211, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9557  -1.4501  -0.8920   0.3307   0.8569  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   1.9258     0.6667   2.889  0.00387 **
IntervalPM    0.8753     0.9423   0.929  0.35294   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.2823) family taken to be 1)

    Null deviance: 21.300  on 15  degrees of freedom
Residual deviance: 20.463  on 14  degrees of freedom
AIC: 166.75

Number of Fisher Scoring iterations: 1


              Theta:  0.2823 
          Std. Err.:  0.0836 

 2 x log-likelihood:  -160.7510

Following the advice to exponentiate the coefficients from https://rpubs.com/kaz_yos/pscl-2 I "exponentiated" (is that a word?) the coefficients from my model:

exp(coef(mod.NB))

(Intercept)  IntervalPM 
   6.860548    2.399691 

If I interpret this output correctly, the average number of Cherry pollen grains is 6.86, and at night, there is 2.40 times more Cherry pollen. Statistically speaking however, the increase in pollen at night is not statistically greater than during the day. (NB: Bear with me, most of my confusion stems from interpreting outputs.)

I ran a zero-inflated model to deal with the mess of zeros in the "CSO" data set, but it is not clear to me how to interpret the results here either.

library(pscl)
#HBA data set
modzH <- zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin")

There aren't any zeros in the HBA data set, so the following error pops up (hence the above NegBin glm).

Error in zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin") : 

invalid dependent variable, minimum count is not zero

#CSO data
modzC <- zeroinfl(formula = Count ~ Interval, data = df.C, dist = "negbin")
summary(modzC)

Call:
zeroinfl(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df.C, 
    dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.8660 -0.3333  0.0610  0.2887  1.1666 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8971     0.3333  -5.691 1.26e-08 ***
IntervalPM   -1.0986     0.6667  -1.648   0.0994 .  
Log(theta)   13.6976   510.5736   0.027   0.9786    

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    -28.49  198393.43       0        1
IntervalPM      12.25  198394.44       0        1
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 888750.384 
Number of iterations in BFGS optimization: 23 
Log-likelihood: -10.13 on 5 Df

On the advice of Ferran Paüls Vergés I again, exponentiated the coefficients from the model. (Is this because the model output gives coefficients in log-odds?)

## Exponentiated coefficients
expCoef <- exp(coef((modzC)))
expCoef

# count_(Intercept)  count_IntervalPM  zero_(Intercept)   zero_IntervalPM 
# 1.500015e-01      3.333271e-01      4.234422e-13      2.092359e+05 

And (if I understand how to apply these) the baseline number of Cherry pollen in a sample that has pollen in it is 1.500015e-01 and the baseline odds of a sample having zero Cherry pollen grains is 4.234422e-13? I'm really uncertain about this sentence, it doesn't seem to make sense.

Upvotes: 0

Views: 1987

Answers (1)

anon
anon

Reputation:

Your Count variable is definitely overdispersed, which means the poisson GLM may be giving you misleading results. I can see the overdispersion in a few ways: first, the residual deviance is a lot larger than degrees of freedom, second, if you run the model as a quasipoisson (https://www.theanalysisfactor.com/glm-r-overdispersion-count-regression/) the dispersion parameter is 46, which is a lot larger than 1, which is assumed for a poisson distribution, and plotting a histogram gives a distribution with a long tail. You have a lot of zeroes and you may benefit from using a zero inflated model. Zero inflated models assume that the zeroes you got came from two sources: detection zeroes and "real" zeroes. That is, maybe there really wasn't any cherry pollen in that first observation, or maybe it was out there and you just didn't detect it.

I think you're on the right track with the first model. Offsets are usually used to account for differences in sampling effort. To me, the most natural candidate for the offset is interval.length, but I can understand why you might use TotalCount.

Looking at a boxplot, I think it seems reasonable that there is no main effect of interval (p = 1), but if you include the interaction between interval and pollen type, the interaction is significant. If you are only interested in Cherry pollen and you're just using the other pollen types to get the TotalPollen for the offset, then filter the df so it's only cherry and drop PollenType from the model and I bet you'll see an effect of interval.

mod <- glm(Count ~ PollenType + Interval, offset = log(TotalCount), data = df, family = poisson)
car::Anova(mod) 

mod2 <- glm(Count ~ PollenType*Interval, offset = log(TotalCount), data = df, family = quasipoisson)
car::Anova(mod2) 

But the first thing I would do is try dealing with the overdispersion.


Response to Edit

First off, you said you'd do separate models for each orchard, but you didn't (data = df). If you do use data = df then you can include orchard as a factor and you get a significant interval:orchard interaction, which is maybe interesting to you? (try looking at ggplot(df, aes(x = Interval, y = log(Count), color = Orchard)) + geom_boxplot()).

If you separate them, you'll have less power overall, but if you're not interested in differences between orchards, it sounds like you're justified in doing that. You could also use all the data and treat orchard as a random effect in a glmer model, but that's adding another layer of complexity.

I'd try and keep things simple by applying the same sort of model to both orchards. The fact that there are no zeros in one of your orchards makes me think that there isn't really two sources of zeroes in the other orchard. That is, the zeros you have are "real" and indicate an absence of pollen, not just a failure to detect pollen. So maybe not zero inflated after all? You'll have to judge for yourself if you have more zeros than you'd expect under a negative binomial distribution. Plus, as you noticed, interpreting zero-inflated models is notoriously difficult.

As for interpreting the outputs of the negative binomial glm, I'd personally avoid looking at summary() too much. You can use it to get coefficients (or you can use coef()), but to test for the significance of Interval, use either car::Anova(mod.NB) or create a null model (mod.NB.0 <- glm.nb(formula = Count ~ 1 + offset(log(Interval.Duration)), data = df)) and use lmtest::lrtest(mod.NB, mod.NB.0).

Upvotes: 0

Related Questions