Reputation: 63
I am trying to do a GLM for a count dataset, and have found that my data is overdispersed and so, not suitable to use a poisson GLM on. I am aware that I have to use a negative binomial GLM instead, which requires a theta value. However, when I try to run a summary of my model I get a series of errors below and it can't find the theta value. Any help on this would be appreciated. I will put a summary of my dataset and my code used to produce a summary of the model and the errors below.
Dataset summary:
The data used for the GLM is total (Count data) and treatment (which is a letter representing different treatments eg. C, M, F)
Code used to produce theta:
summary(m1 <- glm.nb(Total ~ Treatment, data = twohour))
Output of this code with errors at the bottom:
Any help with producing a theta value would be greatly appreciated. Thanks in advance.
As requested, the summary and model outputs as text:
summary:
> summary(twohour)
Treatment | Length | ID | Block1 Block2 | Fertility | Notes | Total
Length:252 | Length:252 | Min. : 1.00 | Min. : 0.0 Min. : 0.00 Min. :0.0000 Length:252 Min. : 0.0
Class :character Class :character 1st Qu.:10.00 1st Qu.:125.8 1st Qu.: 39.50 1st Qu.:1.0000 Class :character 1st Qu.:172.2
Mode :character Mode :character Median :19.50 Median :154.0 Median :104.50 Median :1.0000 Mode :character Median :263.0
Mean :19.89 Mean :143.5 Mean : 94.66 Mean :0.9683 Mean :238.1
3rd Qu.:30.00 3rd Qu.:179.2 3rd Qu.:146.00 3rd Qu.:1.0000 3rd Qu.:309.5
Max. :40.00 Max. :227.0 Max. :228.00 Max. :1.0000 Max. :434.0
model output:
> Call: glm.nb(formula = Total ~ Treatment, data = twohour, init.theta =
> 2055605.705,
> link = log)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -23.001 -4.624 1.650 4.567 12.571
>
> Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 5.577987 0.009846 566.534 < 2e-16 *** TreatmentC -0.102625 0.014394 -7.130 1.01e-12 *** TreatmentF -0.154580 0.014396 -10.737 < 2e-16 *** TreatmentF30 -0.298972 0.019920 -15.008 < 2e-16 *** TreatmentM -0.158733 0.014613 -10.862 < 2e-16 ***
TreatmentM30 -0.044795 0.013992 -3.201 0.00137 ** TreatmentMxF
-0.105191 0.014211 -7.402 1.34e-13 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2055606) family taken to
be 1)
Null deviance: 15127 on 251 degrees of freedom Residual deviance: 14799 on 245 degrees of freedom AIC: 16542
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L, : invalid 'nsmall' argument In addition: Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =control$trace : iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
4: In sqrt(1/i) : NaNs produced
Upvotes: 1
Views: 2408
Reputation: 226741
tl;dr I suspect this is driven by outliers, especially (??) some zero values that are inconsistent with the rest of the data set. If the zero values aren't errors/weird cases, you might consider a zero-inflated model ...???
We probably need your data to know for sure what's going on. Here's what I can gather so far:
I can get most of the way by constructing a data set that's mostly Poisson, but with a few extreme outliers:
n <- 252 ## total number of obs
ng <- 7 ## number of groups/treatments
mu <- exp(6) ## mean response
## NOTE: this doesn't match your data, I did it accidentally,
## but it does reproduce the errors.
set.seed(101)
dd <- data.frame(
## mostly Poisson, but with 2 values at the min and max values
y = c(rpois(n-4, lambda=mu), rep(c(0,434), each=2)),
f = factor(rep(1:ng, length.out = n))
)
summary(dd)
library(MASS)
m2 <- glm.nb(y~f, data = dd)
(the zero values look like the largest problem. I can reproduce the problem with 2 (but not 1) zero outliers, with the rest of the data Poisson with a large mean ...)
Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
4: In sqrt(1/i) : NaNs produced
Results:
Call:
glm.nb(formula = y ~ f, data = dd, init.theta = 12474197.56,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-28.1363 -0.4887 0.0444 0.7153 3.4771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.0005206 0.0082958 723.319 < 2e-16 ***
f2 0.0006192 0.0117302 0.053 0.95790
f3 0.0015129 0.0117276 0.129 0.89736
f4 -0.0328793 0.0118297 -2.779 0.00545 **
f5 -0.0195274 0.0117898 -1.656 0.09766 .
f6 0.0068583 0.0117120 0.586 0.55816
f7 -0.0087784 0.0117579 -0.747 0.45531
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(12474198) family taken to be 1)
Null deviance: 1837.2 on 251 degrees of freedom
Residual deviance: 1820.0 on 245 degrees of freedom
AIC: 3795.6
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid 'nsmall' argument
A little digging shows that this particular error is caused because the standard error of the theta estimate can't be computed (it's NaN
) ...
Looking at the diagnostics (plot(m2)
) shows the outliers clearly:
The following works fine (more or less: it gives ridiculous theta
estimates because the data are not overdispersed once zero-inflation is accounted for).
library(pscl)
zeroinfl(y~f, dist="negbin",data = dd)
Upvotes: 1