Why am I getting a value of 1 for Pr(|z|) in the output of my glm?

Question

I am trying to fit a generalized linear model to my data. I collected Pollen from the air during two (unequal) time intervals: day (9 hours) and night (15 hours) and identified the pollen as Cherry, Other angiosperm, and Other gymnosperm, based on morphological characteristics.

When I look at the raw data there are some interesting-looking trends in the amount of Cherry pollen in the air at night compared to during the day, and these trends differ quite a bit between the two orchards I sampled. I want to measure how the time interval impacts my model.

My problems (non-exhaustive list :p)

1) I do not understand why the Interval term keeps giving me a value of 1 for Pr(>|z|). 2) I don't know how to account for the unequal time intervals.

I have tried this glm with two error distributions, poisson and binomial, depending on the format of the response variable.

Here is a sub-sample of my data:

Orchard <- c("CSO", "CSO", "CSO", "HBA", "HBA", "HBA")
Interval <- c("AM", "AM", "AM", "PM", "PM", "PM")
Interval.Duration <- c(9,9,9,15,15,15)
PollenType <- c("Cherry", "Other angiosperm", "Other gymnosperm")
Count <- c(0,2,11,245,124,5,0,2,19,80,38,0,1,0,3,200,150,1)
TotalCount <- c(13,13,13,374,374,374,21,21,21,118,118,118,4,4,4,351,351,351)

df <- data.frame(Orchard, Interval, Interval.Duration, PollenType, Count, TotalCount)
df

# Poisson error distribution model
mod <- glm(Count ~ PollenType + Interval + offset(log(TotalCount)), data = df, family = poisson)
summary(mod) 

Call:
glm(formula = Count ~ PollenType + Interval + offset(log(TotalCount)), 
    family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0076  -3.0242  -0.9525   1.3507   8.8612  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -5.158e-01  1.646e-01  -3.134  0.00172 ** 
PollenTypeOther angiosperm -5.096e-01  7.117e-02  -7.159  8.1e-13 ***
PollenTypeOther gymnosperm -2.602e+00  1.660e-01 -15.677  < 2e-16 ***
IntervalPM                 -1.532e-14  1.658e-01   0.000  1.00000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 742.63  on 17  degrees of freedom
Residual deviance: 240.61  on 14  degrees of freedom
AIC: 313.04

Number of Fisher Scoring iterations: 7


# Binomial error distribution model
mod.bin <- glm(Count/TotalCount ~ PollenType + Interval, data = df, family = binomial)
summary(mod.bin)


Call:
glm(formula = Count/TotalCount ~ PollenType + Interval, family = binomial, 
    data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.04298  -0.88408   0.03032   0.56505   1.02296  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)
(Intercept)                -5.805e-01  9.912e-01  -0.586    0.558
PollenTypeOther angiosperm -6.754e-01  1.300e+00  -0.519    0.603
PollenTypeOther gymnosperm  2.558e-01  1.187e+00   0.216    0.829
IntervalPM                  7.973e-14  1.016e+00   0.000    1.000

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.7044  on 17  degrees of freedom
Residual deviance: 9.1325  on 14  degrees of freedom
AIC: 28.299

Number of Fisher Scoring iterations: 4

The output of 1 for Interval tells me that I've done something wrong, or that I'm not asking my question properly in the glm.

I suspect the problem could have something to do with how my data set is formatted, but I really don't know where to start.

~~ EDIT ~~

Thanks @Eric-Scott for the detailed feedback. I reworked the dummy data to reflect only Cherry counts, and made subsets for each orchard because I don’t think it makes sense to include data from both orchards in the same model. I used the "Interval.Duration" as the offset as you recommended.

Orchard <- c("HBA", "CSO")
Interval <- c("AM", "AM", "PM","PM")
Interval.Duration <- c(9,15)
Count <- c(13,4,245,0,80,2,98,1,200,1,530,1,196,2,311,1)
TotalCount <- c(38,7,374,21,118,15,144,4,351,10,884,12,338,34,490,15)

df <- data.frame(Orchard, Interval, Interval.Duration, Count, TotalCount)
df
df.H <- subset(df, Orchard == "HBA")
df.C <- subset(df, Orchard == "CSO")

#HBA data
modH <- glm(Count ~ Interval + offset(log(Interval.Duration)),  data = df.H, family = quasipoisson)

Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    family = quasipoisson, data = df.H)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-13.392   -6.225   -1.096    6.204   12.226  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.6088     0.4263   6.120 0.000869 ***
IntervalPM    0.8843     0.5067   1.745 0.131574    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 88.85893)

    Null deviance: 892.38  on 7  degrees of freedom
Residual deviance: 594.73  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

#CSO data
modC <- glm(Count ~ Interval + offset(log(Interval.Duration)), data = df.C, family = quasipoisson)

Call:
glm(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    family = quasipoisson, data = df.C)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.22474  -0.36170   0.05231   0.27453   1.05020  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.8971     0.2400  -7.904 0.000218 ***
IntervalPM   -1.0986     0.4801  -2.289 0.062070 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.5185185)

    Null deviance: 6.9044  on 7  degrees of freedom
Residual deviance: 3.7649  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Using the dummy data, I can see overdispersion and underdispersion in the “HBA” and “CSO” models, respectively (although when I run the model on the full data set, there is overdispersion in both).

First, I fit a model with a Negative Binomial distribution, since I know that a zero-inflated model is not going to help with the overdispersion in the HBA data set (there aren't any zeros).

library(MASS)
mod.NB <- glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df)
summary(mod.NB)

Call:
glm.nb(formula = Count ~ Interval + offset(log(Interval.Duration)), 
    data = df, init.theta = 0.2822584211, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9557  -1.4501  -0.8920   0.3307   0.8569  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   1.9258     0.6667   2.889  0.00387 **
IntervalPM    0.8753     0.9423   0.929  0.35294   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.2823) family taken to be 1)

    Null deviance: 21.300  on 15  degrees of freedom
Residual deviance: 20.463  on 14  degrees of freedom
AIC: 166.75

Number of Fisher Scoring iterations: 1


              Theta:  0.2823 
          Std. Err.:  0.0836 

 2 x log-likelihood:  -160.7510

Following the advice to exponentiate the coefficients from https://rpubs.com/kaz_yos/pscl-2 I "exponentiated" (is that a word?) the coefficients from my model:

exp(coef(mod.NB))

(Intercept)  IntervalPM 
   6.860548    2.399691

If I interpret this output correctly, the average number of Cherry pollen grains is 6.86, and at night, there is 2.40 times more Cherry pollen. Statistically speaking however, the increase in pollen at night is not statistically greater than during the day. (NB: Bear with me, most of my confusion stems from interpreting outputs.)

I ran a zero-inflated model to deal with the mess of zeros in the "CSO" data set, but it is not clear to me how to interpret the results here either.

library(pscl)
#HBA data set
modzH <- zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin")

There aren't any zeros in the HBA data set, so the following error pops up (hence the above NegBin glm).

Error in zeroinfl(formula = Count ~ Interval, data = df.H, dist = "negbin") :

invalid dependent variable, minimum count is not zero

#CSO data
modzC <- zeroinfl(formula = Count ~ Interval, data = df.C, dist = "negbin")
summary(modzC)

Call:
zeroinfl(formula = Count ~ Interval + offset(log(Interval.Duration)), data = df.C, 
    dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.8660 -0.3333  0.0610  0.2887  1.1666 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8971     0.3333  -5.691 1.26e-08 ***
IntervalPM   -1.0986     0.6667  -1.648   0.0994 .  
Log(theta)   13.6976   510.5736   0.027   0.9786    

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    -28.49  198393.43       0        1
IntervalPM      12.25  198394.44       0        1
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 888750.384 
Number of iterations in BFGS optimization: 23 
Log-likelihood: -10.13 on 5 Df

On the advice of Ferran Paüls Vergés I again, exponentiated the coefficients from the model. (Is this because the model output gives coefficients in log-odds?)

## Exponentiated coefficients
expCoef <- exp(coef((modzC)))
expCoef

# count_(Intercept)  count_IntervalPM  zero_(Intercept)   zero_IntervalPM 
# 1.500015e-01      3.333271e-01      4.234422e-13      2.092359e+05

And (if I understand how to apply these) the baseline number of Cherry pollen in a sample that has pollen in it is 1.500015e-01 and the baseline odds of a sample having zero Cherry pollen grains is 4.234422e-13? I'm really uncertain about this sentence, it doesn't seem to make sense.

Why am I getting a value of 1 for Pr(>|z|) in the output of my glm?

Answers (1)

Related Questions

Why am I getting a value of 1 for Pr(&gt;|z|) in the output of my glm?

Answers (1)

Related Questions

Why am I getting a value of 1 for Pr(>|z|) in the output of my glm?