Reputation: 726
I am fitting a negative binomial model using glm.nb from the MASS package of R with as output Y depending on variable y1, x, and z .
The problem is that I get different estimated coefficients using the same model but with the following equivalent formulas: Y ~ y1 + x * z
or Y ~ x * z + y1
. (reversing the order)
Below are my data and the results with the two formulas:
z <-c("T3", "T3", "T3", "T2", "T2", "T3", "T3", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T2", "T1", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T2", "T3", "T2", "T2", "T2", "T2", "T3", "T3", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T3", "T3", "T3", "T2", "T1", "T3", "T3", "T3", "T3", "T2", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T3", "T2", "T3", "T2", "T3", "T3", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T3", "T2", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T2", "T3", "T2", "T3", "T2", "T2", "T2", "T2", "T3", "T3", "T3", "T2", "T2", "T3", "T2", "T3", "T3", "T2", "T2")
x <- c("B-", "B-", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B+", "B-", "B+", "B+", "B-", "B-", "B+", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B+", "B+", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B+", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B+", "B-", "B+", "B+", "B+", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B-", "B-", "B-", "B-", "B-", "B+", "B+", "B-", "B-", "B+", "B+", "B-", "B+", "B+", "B-", "B-", NA )
Y <- c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 13, 19, 1, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 2, 0, 0, 6, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 10, 0, 0, 0, 0, 0, 0, 3)
y1 <- c(1, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 1, 0, 5, 0, 2, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 12, 1, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 4, 0, 0, 0, 2, 2, 0, 0, 1, 0, 0, 16, 0, 14, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 6, 0, 0, 1, 0, 0, 1, 13)
data1 <- data.frame(Y, y1, z, x)
glm.nb(Y ~ x * z + y1, data = data1)
glm.nb(Y ~ y1 + x * z, data = data1)
> glm.nb(Y ~ x * z + y1, data = data1)
Call: glm.nb(formula = Y ~ x * z + y1, data = data1, init.theta = 0.2911311528,
link = log)
Coefficients:
(Intercept) x1 z1 z2 y1 x1:z1 x1:z2
-13.27592 -0.06054 -24.03075 11.75385 0.38529 0.54648 -0.11488
Degrees of Freedom: 112 Total (i.e. Null); 106 Residual
(1 observation deleted due to missingness)
Null Deviance: 94.45
Residual Deviance: 61.74 AIC: 199.8
> glm.nb(Y ~ y1 + x * z, data = data1)
Call: glm.nb(formula = Y ~ y1 + x * z, data = data1, init.theta = 0.2911311528,
link = log)
Coefficients:
(Intercept) y1 x1 z1 z2 x1:z1 x1:z2
-13.4950 0.3853 -0.7217 -24.4689 11.9729 -0.7759 0.5463
Degrees of Freedom: 112 Total (i.e. Null); 106 Residual
(1 observation deleted due to missingness)
Null Deviance: 94.45
Residual Deviance: 61.74 AIC: 199.8
I get different values of coefficients between the two formulas. Knowing that the variable z has only two observations with the T1 modality, can this influence the estimation of my coefficients?
Is it possible to have identical results with the two formulas Y ~ y1 + x * z
and Y ~ x * z + y1
even if there is a convergence problem?
Thank you for your answers and your help.
Upvotes: 3
Views: 366
Reputation: 226007
This is admittedly a weird result — I would have expected some sort of warning. However, I think in the end you shouldn't worry about this difference, for reasons I will explain (tl;dr this is an optimistic fit to a small, noisy data set, and none of the interesting results differ between models).
all.equal(logLik(m1), logLik(m2))
says that the models have equally good fit (up to numerical tolerance)all.equal(predict(m1), predict(m2))
says that the predictions are not identical — off by 2% on average. However:bad <- which(abs(predict(m1)-predict(m2))>1e-3)
predict(m1)[bad]
## 19 53
## -34.88559 -36.20873
predict(m2)[bad]
19 53
## -35.99086 -39.19723
These predictions are on the log scale, so the predicted values are tiny, and the differences between the predictions are tiny on the count scale (e.g. exp(-35)
is 6.3e-16 and exp(-36)
is 2.3e-16).
The confidence intervals for all coefficients except the effect of y1
are ridiculously large (and all of the p-values are 1):
library(dotwhisker)
dwplot(list(m1,m2))
(If we set ci = NA
to turn off the confidence intervals we could see the differences in the coefficients.)
This is an example of complete separation — not only are there only 2 observations in the T1 condition, but they're both zero ...
It would be wiser to omit the T1 condition entirely, it's just messing things up. In that case (m1A <- update(m1, data = subset(data, z != "T1"))
and similarly for m2
) you still have a significant effect of y1
, and the two orders agree.
Upvotes: 2