Lodore66
Lodore66

Reputation: 1185

Which is the correct way to specify crossed effects in a mixed linear model in statsmodels?

I have a question about crossed effects linear mixed models in statsmodels. Specifically, I can see two ways of representing my data and I'm not sure which is appropriate. Any advice appreciated!

My data are as below. I wish to ascertain whether the objective quality of a book ('good' or 'bad') predicts the score the book is assigned. Titles are nested within the quality variable, but titles and raters are crossed. (This is fake data so I'm not worried about models converging.)

rater   title   quality score
john    book_1  good    0.600833333
frank   book_2  bad     0.683020833
emma    book_3  good    0.653645833
john    book_4  bad     0.6528125
frank   book_5  good    0.6040625
emma    book_1  good    0.600833333
john    book_2  bad     0.522
frank   book_3  good    0.600833333
emma    book_4  bad     0.619464286
john    book_5  good    0.600833333
frank   book_1  good    0.57125
emma    book_2  bad     0.6296875
john    book_3  good    0.607205882
frank   book_4  bad     0.61203125
emma    book_5  good    0.600833333

One way to analyse this data is to take quality as my independent variable, score as my dependent variable, rater as my grouping variable, and use variance components to capture the crossed effects on title. This gives:

import statsmodels.api as sm
import statsmodels.fomula.api as smf

md = smf.mixedlm('score ~ quality', vc_formula = {"title":"0 + title"}, groups = data['rater'], data = data).fit().summary()

Model summary:

           Mixed Linear Model Regression Results
===========================================================
Model:              MixedLM   Dependent Variable:   score  
No. Observations:   15        Method:               REML   
No. Groups:         3         Scale:                0.0007 
Min. group size:    5         Log-Likelihood:       22.1997
Max. group size:    5         Converged:            Yes    
Mean group size:    5.0                                    
-----------------------------------------------------------
                Coef.  Std.Err.    z    P>|z| [0.025 0.975]
-----------------------------------------------------------
Intercept        0.620    0.001 841.098 0.000  0.618  0.621
quality[T.good] -0.015    0.013  -1.158 0.247 -0.041  0.011
title Var        0.001                                     
===========================================================

This intuitively seems to me to be the correct approach. It gives me a p-value and coefficient for my IV and accounts for the crossed effects.

However, I've been advised elsewhere that crossed effects like this should be specified by treating the dataset as one group and specifying variation entirely using variance components. Thus:

data['groups'] = 1

md = smf.mixedlm('score ~ 1', vc_formula = {"rater":"0 + rater", "title":"0 + title", "quality":"0 + quality"}, groups = data['groups'], data = data).fit().summary()

Yielding:

        Mixed Linear Model Regression Results
=====================================================
Model:            MixedLM Dependent Variable: score  
No. Observations: 15      Method:             REML   
No. Groups:       1       Scale:              0.0013 
Min. group size:  15      Log-Likelihood:     24.4023
Max. group size:  15      Converged:          No     
Mean group size:  15.0                               
-----------------------------------------------------
            Coef. Std.Err.   z    P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept   0.612    0.017 35.011 0.000  0.577  0.646
quality Var 0.000                                    
rater Var   0.000    0.020                           
title Var   0.000                                    
=====================================================

This model offers me no p-value, different coefficients, and different model test statistics all around. Now, I'm either wrong in my use of both models, or I'm wrong in my use of one of them. Can anyone advise me which is the case? Thanks.

Upvotes: 2

Views: 2660

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

You are comparing two totally different models, which gives different interpretation.

If you are interested in how the quality affects the school, while putting the other covariates as a random intercept, the first model is correct. I quickly checked and found this post mentioning the cross effect, and if you run the model with all your random intercepts in the variance component, you get roughly the same result:

data['group'] = 1
md2 = smf.mixedlm('score ~ quality', 
                  vc_formula = {"title":"0 + title","rater":"0 + rater"}, 
                  groups = data['group'], data = data).fit().summary()

Model:  MixedLM Dependent Variable: score
No. Observations:   15  Method: REML
No. Groups: 1   Scale:  0.0014
Min. group size:    15  Log-Likelihood: 22.0951
Max. group size:    15  Converged:  No
Mean group size:    15.0        
Coef.   Std.Err.    z   P>|z|   [0.025  0.975]
Intercept   0.620   0.016   38.313  0.000   0.588   0.652
quality[T.good] -0.015  0.021   -0.736  0.462   -0.056  0.026
rater Var   0.000   0.030               
title Var   0.000           

It's a matter of whether the 'rater' appears in this table. The VC is also for specifying more complex models e.g variable slope, so if it is pure random intercept, I think you can use a combination of group and VC.

In the second model, you are modeling score with only an intercept, independent of quality, which doesn't makes sense if you are interested in quality.

Lastly we can check with the results in R with the following:

df = structure(list(rater = structure(c(3L, 2L, 1L, 3L, 2L, 1L, 3L, 
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L), .Label = c("emma", "frank", 
"john"), class = "factor"), title = structure(c(1L, 2L, 3L, 4L, 
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("book_1", 
"book_2", "book_3", "book_4", "book_5"), class = "factor"), quality = structure(c(2L, 
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("bad", 
"good"), class = "factor"), score = c(0.600833333, 0.683020833, 
0.653645833, 0.6528125, 0.6040625, 0.600833333, 0.522, 0.600833333, 
0.619464286, 0.600833333, 0.57125, 0.6296875, 0.607205882, 0.61203125, 
0.600833333)), class = "data.frame", row.names = c(NA, -15L))

library(lme4)
summary(lmer(score ~ quality + (1|rater) + (1|title),data=df))
boundary (singular) fit: see ?isSingular
Linear mixed model fit by REML ['lmerMod']
Formula: score ~ quality + (1 | rater) + (1 | title)
   Data: df

REML criterion at convergence: -44.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.60015 -0.09695 -0.09695  0.16712  1.67924 

Random effects:
 Groups   Name        Variance Std.Dev.
 title    (Intercept) 0.000000 0.00000 
 rater    (Intercept) 0.000000 0.00000 
 Residual             0.001416 0.03763 
Number of obs: 15, groups:  title, 5; rater, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.61984    0.01536  40.351
qualitygood -0.01535    0.01983  -0.774

Correlation of Fixed Effects:
            (Intr)
qualitygood -0.775
convergence code: 0
boundary (singular) fit: see ?isSingular

More or less similar to your first model, but because of your data, it's not easy to get an estimate of your random effects.

Upvotes: 3

Related Questions