Ph.D.Student
Ph.D.Student

Reputation: 726

Negative binomial glm.nb : Different results with unbalanced variable and complete separation

I am fitting a negative binomial model using glm.nb from the MASS package of R with as output Y depending on variable y1, x, and z . The problem is that I get different estimated coefficients using the same model but with the following equivalent formulas: Y ~ y1 + x * z or Y ~ x * z + y1. (reversing the order)

Below are my data and the results with the two formulas:

z <-c("T3", "T3", "T3", "T2", "T2", "T3", "T3", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T2", "T1", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T2", "T3", "T2", "T2", "T2", "T2", "T3", "T3", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T3", "T3", "T3", "T2", "T1", "T3", "T3", "T3", "T3", "T2", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T3", "T2", "T3", "T2", "T3", "T3", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T3", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T2", "T2", "T3", "T2", "T3", "T3", "T2", "T2")
x <- c("B-", "B-", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B+", "B-", "B+", "B+", "B-", "B-", "B+", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B+", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B+", "B-", "B+", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B-", "B+", "B+", "B-", "B-", NA )
Y <- c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 13, 19, 1, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 2, 0, 0, 6, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 10, 0, 0, 0, 0, 0, 0, 3)
y1 <- c(1, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 1, 0, 5, 0, 2, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 12, 1, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 4, 0, 0, 0, 2, 2, 0, 0, 1, 0, 0, 16, 0, 14, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 6, 0, 0, 1, 0, 0, 1, 13)

data1 <- data.frame(Y, y1, z, x)
glm.nb(Y ~  x * z + y1, data = data1)

glm.nb(Y ~  y1 + x * z, data = data1)

> glm.nb(Y ~  x * z + y1, data = data1)

Call:  glm.nb(formula = Y ~ x * z + y1, data = data1, init.theta = 0.2911311528, 
    link = log)

Coefficients:
(Intercept)           x1           z1           z2           y1        x1:z1        x1:z2  
  -13.27592     -0.06054    -24.03075     11.75385      0.38529      0.54648     -0.11488  

Degrees of Freedom: 112 Total (i.e. Null);  106 Residual
  (1 observation deleted due to missingness)
Null Deviance:      94.45 
Residual Deviance: 61.74    AIC: 199.8

> glm.nb(Y ~  y1 + x * z, data = data1)

Call:  glm.nb(formula = Y ~ y1 + x * z, data = data1, init.theta = 0.2911311528, 
    link = log)

Coefficients:
(Intercept)           y1           x1           z1           z2        x1:z1        x1:z2  
   -13.4950       0.3853      -0.7217     -24.4689      11.9729      -0.7759       0.5463  

Degrees of Freedom: 112 Total (i.e. Null);  106 Residual
  (1 observation deleted due to missingness)
Null Deviance:      94.45 
Residual Deviance: 61.74    AIC: 199.8

I get different values of coefficients between the two formulas. Knowing that the variable z has only two observations with the T1 modality, can this influence the estimation of my coefficients?

Is it possible to have identical results with the two formulas Y ~ y1 + x * z and Y ~ x * z + y1 even if there is a convergence problem?

Thank you for your answers and your help.

Upvotes: 3

Views: 366

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226007

This is admittedly a weird result — I would have expected some sort of warning. However, I think in the end you shouldn't worry about this difference, for reasons I will explain (tl;dr this is an optimistic fit to a small, noisy data set, and none of the interesting results differ between models).

  • all.equal(logLik(m1), logLik(m2)) says that the models have equally good fit (up to numerical tolerance)
  • all.equal(predict(m1), predict(m2)) says that the predictions are not identical — off by 2% on average. However:
bad <- which(abs(predict(m1)-predict(m2))>1e-3)
predict(m1)[bad]
##        19        53 
## -34.88559 -36.20873 
predict(m2)[bad]
         19        53 
## -35.99086 -39.19723 

These predictions are on the log scale, so the predicted values are tiny, and the differences between the predictions are tiny on the count scale (e.g. exp(-35) is 6.3e-16 and exp(-36) is 2.3e-16).

The confidence intervals for all coefficients except the effect of y1 are ridiculously large (and all of the p-values are 1):

library(dotwhisker)
dwplot(list(m1,m2))

enter image description here

(If we set ci = NA to turn off the confidence intervals we could see the differences in the coefficients.)

This is an example of complete separation — not only are there only 2 observations in the T1 condition, but they're both zero ...

It would be wiser to omit the T1 condition entirely, it's just messing things up. In that case (m1A <- update(m1, data = subset(data, z != "T1")) and similarly for m2) you still have a significant effect of y1, and the two orders agree.

Upvotes: 2

Related Questions