Reputation: 1185
I have a question about crossed effects linear mixed models in statsmodels. Specifically, I can see two ways of representing my data and I'm not sure which is appropriate. Any advice appreciated!
My data are as below. I wish to ascertain whether the objective quality of a book ('good' or 'bad') predicts the score the book is assigned. Titles are nested within the quality variable, but titles and raters are crossed. (This is fake data so I'm not worried about models converging.)
rater title quality score
john book_1 good 0.600833333
frank book_2 bad 0.683020833
emma book_3 good 0.653645833
john book_4 bad 0.6528125
frank book_5 good 0.6040625
emma book_1 good 0.600833333
john book_2 bad 0.522
frank book_3 good 0.600833333
emma book_4 bad 0.619464286
john book_5 good 0.600833333
frank book_1 good 0.57125
emma book_2 bad 0.6296875
john book_3 good 0.607205882
frank book_4 bad 0.61203125
emma book_5 good 0.600833333
One way to analyse this data is to take quality as my independent variable, score as my dependent variable, rater as my grouping variable, and use variance components to capture the crossed effects on title. This gives:
import statsmodels.api as sm
import statsmodels.fomula.api as smf
md = smf.mixedlm('score ~ quality', vc_formula = {"title":"0 + title"}, groups = data['rater'], data = data).fit().summary()
Model summary:
Mixed Linear Model Regression Results
===========================================================
Model: MixedLM Dependent Variable: score
No. Observations: 15 Method: REML
No. Groups: 3 Scale: 0.0007
Min. group size: 5 Log-Likelihood: 22.1997
Max. group size: 5 Converged: Yes
Mean group size: 5.0
-----------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------------
Intercept 0.620 0.001 841.098 0.000 0.618 0.621
quality[T.good] -0.015 0.013 -1.158 0.247 -0.041 0.011
title Var 0.001
===========================================================
This intuitively seems to me to be the correct approach. It gives me a p-value and coefficient for my IV and accounts for the crossed effects.
However, I've been advised elsewhere that crossed effects like this should be specified by treating the dataset as one group and specifying variation entirely using variance components. Thus:
data['groups'] = 1
md = smf.mixedlm('score ~ 1', vc_formula = {"rater":"0 + rater", "title":"0 + title", "quality":"0 + quality"}, groups = data['groups'], data = data).fit().summary()
Yielding:
Mixed Linear Model Regression Results
=====================================================
Model: MixedLM Dependent Variable: score
No. Observations: 15 Method: REML
No. Groups: 1 Scale: 0.0013
Min. group size: 15 Log-Likelihood: 24.4023
Max. group size: 15 Converged: No
Mean group size: 15.0
-----------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept 0.612 0.017 35.011 0.000 0.577 0.646
quality Var 0.000
rater Var 0.000 0.020
title Var 0.000
=====================================================
This model offers me no p-value, different coefficients, and different model test statistics all around. Now, I'm either wrong in my use of both models, or I'm wrong in my use of one of them. Can anyone advise me which is the case? Thanks.
Upvotes: 2
Views: 2660
Reputation: 46978
You are comparing two totally different models, which gives different interpretation.
If you are interested in how the quality affects the school, while putting the other covariates as a random intercept, the first model is correct. I quickly checked and found this post mentioning the cross effect, and if you run the model with all your random intercepts in the variance component, you get roughly the same result:
data['group'] = 1
md2 = smf.mixedlm('score ~ quality',
vc_formula = {"title":"0 + title","rater":"0 + rater"},
groups = data['group'], data = data).fit().summary()
Model: MixedLM Dependent Variable: score
No. Observations: 15 Method: REML
No. Groups: 1 Scale: 0.0014
Min. group size: 15 Log-Likelihood: 22.0951
Max. group size: 15 Converged: No
Mean group size: 15.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 0.620 0.016 38.313 0.000 0.588 0.652
quality[T.good] -0.015 0.021 -0.736 0.462 -0.056 0.026
rater Var 0.000 0.030
title Var 0.000
It's a matter of whether the 'rater' appears in this table. The VC is also for specifying more complex models e.g variable slope, so if it is pure random intercept, I think you can use a combination of group and VC.
In the second model, you are modeling score with only an intercept, independent of quality, which doesn't makes sense if you are interested in quality.
Lastly we can check with the results in R with the following:
df = structure(list(rater = structure(c(3L, 2L, 1L, 3L, 2L, 1L, 3L,
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L), .Label = c("emma", "frank",
"john"), class = "factor"), title = structure(c(1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("book_1",
"book_2", "book_3", "book_4", "book_5"), class = "factor"), quality = structure(c(2L,
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("bad",
"good"), class = "factor"), score = c(0.600833333, 0.683020833,
0.653645833, 0.6528125, 0.6040625, 0.600833333, 0.522, 0.600833333,
0.619464286, 0.600833333, 0.57125, 0.6296875, 0.607205882, 0.61203125,
0.600833333)), class = "data.frame", row.names = c(NA, -15L))
library(lme4)
summary(lmer(score ~ quality + (1|rater) + (1|title),data=df))
boundary (singular) fit: see ?isSingular
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ quality + (1 | rater) + (1 | title)
Data: df
REML criterion at convergence: -44.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.60015 -0.09695 -0.09695 0.16712 1.67924
Random effects:
Groups Name Variance Std.Dev.
title (Intercept) 0.000000 0.00000
rater (Intercept) 0.000000 0.00000
Residual 0.001416 0.03763
Number of obs: 15, groups: title, 5; rater, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.61984 0.01536 40.351
qualitygood -0.01535 0.01983 -0.774
Correlation of Fixed Effects:
(Intr)
qualitygood -0.775
convergence code: 0
boundary (singular) fit: see ?isSingular
More or less similar to your first model, but because of your data, it's not easy to get an estimate of your random effects.
Upvotes: 3