Reputation: 1204

Big data: generalized linear mixed-effects models

I'm looking for suggestions for a strategy of fitting generalized linear mixed-effects models for a relative large data-set.

Consider I have data on 8 milllion US basketball passes on about 300 teams in 10 years. The data looks something like this:

data <- data.frame(count = c(1,1,2,1,1,5),
               length_pass= c(1,2,5,7,1,3),
               year= c(1,1,1,2,2,2),
               mean_length_pass_team= c(15,15,9,14,14,8),
               team= c('A', 'A', 'B', 'A', 'A', 'B'))
data
 count length_pass year mean_length_pass_team team
1     1           1    1                    15    A
2     1           2    1                    15    A
3     2           5    1                     9    B
4     1           7    2                    14    A
5     1           1    2                    14    A
6     5           3    2                     8    B

I'm want to explain the count of steps a player takes before passing the ball. I have theoretical motivations to assume there are team-level differences between count and length_pass, so a multi-level (i.e. mixed effects) model seems appropriate.

My individual level control variables are length_pass and year.

On the team-level I have mean_length_pass_team. This should help me to avoid ecological fallacies, according to Snijders, 2011.

I have been using the lme4 and brms packages to estimate these models but it takes days/weeks to fit these models on my local 12-core 128GB machine.

library(lme4)
model_a <- glmer(count ~ length_pass + year + mean_length_pass_team + (1 | team),
                 data=data,
                 family= "poisson",
                 control=glmerControl(optCtrl=list(maxfun=2e8))) 

library(brms)
options (mc.cores=parallel::detectCores ())
model_b <- brm(count ~ length_pass + year + mean_length_pass_team + (1 | team),
                 data=data,
                 family= "poisson")

I am looking for suggestions to speed up the fitting process or new techniques to fit a generalized linear mixed-effects model:

(How) Can I improve the speed on the lme4 and brms fits?
Are there other packages to consider?
Are there step-wise procedures that can help increase the speed of fitting models?
Are there interesting options outside the R environment that can help me fit this?

Any pointers are much appreciated!

Upvotes: 6

Answers (3)

swihart

Reputation: 2738

It's not perfect everytime, but mgcv::bam() can fit this type of model quicker than glmer(..., nAGQ=0). Try the code below: 9 million rows and 300 groups (teams) for simulated data.

mgcv:bam() is fitted in 30 seconds compared to
glmer(..., nAGQ=0) taking ~10x as long at 5 minutes
Credit to Michael Clark's blog post and their package mixedup on Github for this approach.

Results from simulation:

extract_fixed_effects(model_a)
## # A tibble: 3 × 7
##   term       value    se         z p_value lower_2.5 upper_97.5
##   <chr>      <dbl> <dbl>     <dbl>   <dbl>     <dbl>      <dbl>
## 1 Intercept -0.363 0.392    -0.925   0.355    -1.13       0.406
## 2 x          0.5   0     76034.      0         0.5        0.5  
## 3 b          1.09  0.574     1.90    0.058    -0.037      2.21
#extract_fixed_effects(model_b)
extract_fixed_effects(model_c)
## # A tibble: 3 × 7
##   term       value    se         t p_value lower_2.5 upper_97.5
##   <chr>      <dbl> <dbl>     <dbl>   <dbl>     <dbl>      <dbl>
## 1 Intercept -0.364 0.389    -0.934   0.35     -1.13       0.399
## 2 x          0.5   0     76033.      0         0.5        0.5  
## 3 b          1.08  0.569     1.89    0.058    -0.038      2.19
 extract_vc(model_a, digits = 5)  
## Computing profile confidence intervals ...
## (4!=1)
## Warning in extract_vc.merMod(model_a, digits = 5): Intervals could not be
## computed
## # A tibble: 1 × 5
##   group effect    variance    sd var_prop
##   <chr> <chr>        <dbl> <dbl>    <dbl>
## 1 g     Intercept     24.5  4.95        1
 #extract_vc(model_b, digits = 5)  
 extract_vc(model_c, digits = 5)  
## # A tibble: 1 × 7
##   group effect    variance    sd sd_2.5 sd_97.5 var_prop
##   <chr> <chr>        <dbl> <dbl>  <dbl>   <dbl>    <dbl>
## 1 g     Intercept     24.1  4.91   4.53    5.33        1

Simulation code:

library(mgcv)
library(lme4)
library(brms)
## https://github.com/m-clark/mixedup
## remotes::install_github("m-clark/mixedup")
library(mixedup)
library(tidyverse)
# data <- data.frame(count = c(1,1,2,1,1,5),
#                length_pass= c(1,2,5,7,1,3),
#                year= c(1,1,1,2,2,2),
#                mean_length_pass_team= c(15,15,9,14,14,8),
#                team= c('A', 'A', 'B', 'A', 'A', 'B'))
# data


set.seed(12358)

N = 9e6                                 # total sample size
n_groups = 300                          # number of groups
g = rep(1:n_groups, e = N/n_groups)      # the group identifier

x = rnorm(N)                  
b = rbinom(n_groups, size = 1, prob=.5)  # a cluster level categorical variable
b = b[g]

sd_g =  5     # standard deviation for the random effect
sigma = 1     # standard deviation for the observation

re0 = rnorm(n_groups, sd = sd_g)  # random effects
re  = re0[g]

lp = 0 + 0.5*x + .25*b + re        # linear predictor 

y = rnorm(N, mean = lp, sd = sigma)               # create a continuous target variable
y_bin = rbinom(N, size = 1, prob = plogis(lp))    # create a binary target variable
y_count = rpois(N, lambda=exp(lp))

d = tibble(x, b, y, y_bin, y_count, g = factor(g))
#table(d$y_count)






start <- Sys.time()
model_a <- glmer( y_count ~ x + b + (1|g),
                 data=d,
                 family= "poisson",
                 control=glmerControl(optCtrl=list(maxfun=2e8)),
                 nAGQ = 0) 
end <- Sys.time()
time_a <- end-start
time_a

start <- Sys.time()
model_c <- mgcv::bam( y_count ~ x + b + s(g, bs='re'),
                      data=d,
                      family= "poisson",
                      discrete=TRUE,
                      gc.level = 0,
                      use.chol = TRUE,
                      nthreads = 12/16,
                      chunk.size = 1000,
                      samfrac = .1
) 
end <- Sys.time()
time_c <- end-start
time_c

extract_fixed_effects(model_a)
extract_fixed_effects(model_c)

extract_vc(model_a, digits = 5)  
extract_vc(model_c, digits = 5)

Update: I ran MCMCglmm on this simulated dataset and got the RStudio abort/crash window. I ran brm on this dataset and it took 3.5 hours and had several warnings and we did not recover the parameters as well as glmer and bam:

# > time_e
# Time difference of 3.556064 hours
#  
#  > extract_fixed_effects(model_e)
# # A tibble: 3 × 5
#   term      value    se lower_2.5 upper_97.5
#   <chr>     <dbl> <dbl>     <dbl>      <dbl>
# 1 Intercept 4.38  1.18      2.90        5.88
# 2 x         0.811 0.609     0.484       2.00
# 3 b         1.04  0.923    -0.271       2.66

# >  extract_vc(model_e, digits = 50)  
# # A tibble: 1 × 7
#   group effect    variance    sd sd_2.5 sd_97.5 var_prop
#   <chr> <chr>        <dbl> <dbl>  <dbl>   <dbl>    <dbl>
# 1 g     Intercept    0.637 0.798  0.106    1.45        1

Simulation code, continued (caution -- model_d (MCMCglmm) crashes and model_e (brm) takes 3.5 hours):

library(MCMCglmm)
start <- Sys.time()
model_d <- MCMCglmm(data = d, family = "poisson",
             fixed = y_count ~ x+b, 
             random = ~ g)
end <- Sys.time()
time_d <- end-start
time_d


library(brms)
start <- Sys.time()
model_e  <- brm(y_count ~ x + b + (1|g),
            data = d, family=poisson,
            chains=4, cores=4)

end <- Sys.time()
time_e <- end-start
time_e

extract_fixed_effects(model_d)

extract_fixed_effects(model_e)

 extract_vc(model_d, digits = 5)  

 extract_vc(model_e, digits = 50)

Upvotes: 2

Lacey Etzkorn

Reputation: 75

I have found the package MCMCglmm to be much faster than brms for models that MCMCglmm can fit (I've sometimes found brms fits models I can't fit with MCMCglmm).

You may need to toy around with the syntax, but it would be something like this:

    MCMCglmm(data = data, family = "poisson",
             fixed = count ~ year, 
             random = ~ team)

A downside is that I have found it difficult in the past to find many online code examples that are connected to an explicit mathematical formulation of the models--it can be difficult to judge whether you are fitting the model you intent to fit. However, your model seems simple enough.

Upvotes: 1

Cameron Kay

Reputation: 73

For general speed improvements, I would suggest using openBLAS instead of the native BLAS. Unfortunately, I don't believe LME4 relies on BLAS.

However, I can also suggest generating the LME4 models in parallel, which would effectively cut your wait time in half.

Upvotes: 0

Big data: generalized linear mixed-effects models

Answers (3)

Related Questions