Reputation: 1204
I'm looking for suggestions for a strategy of fitting generalized linear mixed-effects models for a relative large data-set.
Consider I have data on 8 milllion US basketball passes on about 300 teams in 10 years. The data looks something like this:
data <- data.frame(count = c(1,1,2,1,1,5),
length_pass= c(1,2,5,7,1,3),
year= c(1,1,1,2,2,2),
mean_length_pass_team= c(15,15,9,14,14,8),
team= c('A', 'A', 'B', 'A', 'A', 'B'))
data
count length_pass year mean_length_pass_team team
1 1 1 1 15 A
2 1 2 1 15 A
3 2 5 1 9 B
4 1 7 2 14 A
5 1 1 2 14 A
6 5 3 2 8 B
I'm want to explain the count
of steps a player takes before passing the ball. I have theoretical motivations to assume there are team-level differences between count
and length_pass
, so a multi-level (i.e. mixed effects) model seems appropriate.
My individual level control variables are length_pass
and year
.
On the team-level I have mean_length_pass_team
. This should help me to avoid ecological fallacies, according to Snijders, 2011.
I have been using the lme4
and brms
packages to estimate these models but it takes days/weeks to fit these models on my local 12-core 128GB machine.
library(lme4)
model_a <- glmer(count ~ length_pass + year + mean_length_pass_team + (1 | team),
data=data,
family= "poisson",
control=glmerControl(optCtrl=list(maxfun=2e8)))
library(brms)
options (mc.cores=parallel::detectCores ())
model_b <- brm(count ~ length_pass + year + mean_length_pass_team + (1 | team),
data=data,
family= "poisson")
I am looking for suggestions to speed up the fitting process or new techniques to fit a generalized linear mixed-effects model:
lme4
and brms
fits?Any pointers are much appreciated!
Upvotes: 6
Views: 2438
Reputation: 2738
It's not perfect everytime, but mgcv::bam()
can fit this type of model quicker than glmer(..., nAGQ=0)
. Try the code below: 9 million rows and 300 groups (teams) for simulated data.
mgcv:bam()
is fitted in 30 seconds compared toglmer(..., nAGQ=0)
taking ~10x as long at 5 minutesmixedup
on Github for this approach.Results from simulation:
extract_fixed_effects(model_a)
## # A tibble: 3 × 7
## term value se z p_value lower_2.5 upper_97.5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Intercept -0.363 0.392 -0.925 0.355 -1.13 0.406
## 2 x 0.5 0 76034. 0 0.5 0.5
## 3 b 1.09 0.574 1.90 0.058 -0.037 2.21
#extract_fixed_effects(model_b)
extract_fixed_effects(model_c)
## # A tibble: 3 × 7
## term value se t p_value lower_2.5 upper_97.5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Intercept -0.364 0.389 -0.934 0.35 -1.13 0.399
## 2 x 0.5 0 76033. 0 0.5 0.5
## 3 b 1.08 0.569 1.89 0.058 -0.038 2.19
extract_vc(model_a, digits = 5)
## Computing profile confidence intervals ...
## (4!=1)
## Warning in extract_vc.merMod(model_a, digits = 5): Intervals could not be
## computed
## # A tibble: 1 × 5
## group effect variance sd var_prop
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 g Intercept 24.5 4.95 1
#extract_vc(model_b, digits = 5)
extract_vc(model_c, digits = 5)
## # A tibble: 1 × 7
## group effect variance sd sd_2.5 sd_97.5 var_prop
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 g Intercept 24.1 4.91 4.53 5.33 1
Simulation code:
library(mgcv)
library(lme4)
library(brms)
## https://github.com/m-clark/mixedup
## remotes::install_github("m-clark/mixedup")
library(mixedup)
library(tidyverse)
# data <- data.frame(count = c(1,1,2,1,1,5),
# length_pass= c(1,2,5,7,1,3),
# year= c(1,1,1,2,2,2),
# mean_length_pass_team= c(15,15,9,14,14,8),
# team= c('A', 'A', 'B', 'A', 'A', 'B'))
# data
set.seed(12358)
N = 9e6 # total sample size
n_groups = 300 # number of groups
g = rep(1:n_groups, e = N/n_groups) # the group identifier
x = rnorm(N)
b = rbinom(n_groups, size = 1, prob=.5) # a cluster level categorical variable
b = b[g]
sd_g = 5 # standard deviation for the random effect
sigma = 1 # standard deviation for the observation
re0 = rnorm(n_groups, sd = sd_g) # random effects
re = re0[g]
lp = 0 + 0.5*x + .25*b + re # linear predictor
y = rnorm(N, mean = lp, sd = sigma) # create a continuous target variable
y_bin = rbinom(N, size = 1, prob = plogis(lp)) # create a binary target variable
y_count = rpois(N, lambda=exp(lp))
d = tibble(x, b, y, y_bin, y_count, g = factor(g))
#table(d$y_count)
start <- Sys.time()
model_a <- glmer( y_count ~ x + b + (1|g),
data=d,
family= "poisson",
control=glmerControl(optCtrl=list(maxfun=2e8)),
nAGQ = 0)
end <- Sys.time()
time_a <- end-start
time_a
start <- Sys.time()
model_c <- mgcv::bam( y_count ~ x + b + s(g, bs='re'),
data=d,
family= "poisson",
discrete=TRUE,
gc.level = 0,
use.chol = TRUE,
nthreads = 12/16,
chunk.size = 1000,
samfrac = .1
)
end <- Sys.time()
time_c <- end-start
time_c
extract_fixed_effects(model_a)
extract_fixed_effects(model_c)
extract_vc(model_a, digits = 5)
extract_vc(model_c, digits = 5)
Update: I ran MCMCglmm
on this simulated dataset and got the RStudio abort/crash window. I ran brm
on this dataset and it took 3.5 hours and had several warnings and we did not recover the parameters as well as glmer
and bam
:
# > time_e
# Time difference of 3.556064 hours
#
# > extract_fixed_effects(model_e)
# # A tibble: 3 × 5
# term value se lower_2.5 upper_97.5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Intercept 4.38 1.18 2.90 5.88
# 2 x 0.811 0.609 0.484 2.00
# 3 b 1.04 0.923 -0.271 2.66
# > extract_vc(model_e, digits = 50)
# # A tibble: 1 × 7
# group effect variance sd sd_2.5 sd_97.5 var_prop
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 g Intercept 0.637 0.798 0.106 1.45 1
Simulation code, continued (caution -- model_d
(MCMCglmm) crashes and model_e
(brm) takes 3.5 hours):
library(MCMCglmm)
start <- Sys.time()
model_d <- MCMCglmm(data = d, family = "poisson",
fixed = y_count ~ x+b,
random = ~ g)
end <- Sys.time()
time_d <- end-start
time_d
library(brms)
start <- Sys.time()
model_e <- brm(y_count ~ x + b + (1|g),
data = d, family=poisson,
chains=4, cores=4)
end <- Sys.time()
time_e <- end-start
time_e
extract_fixed_effects(model_d)
extract_fixed_effects(model_e)
extract_vc(model_d, digits = 5)
extract_vc(model_e, digits = 50)
Upvotes: 2
Reputation: 75
I have found the package MCMCglmm
to be much faster than brms
for models that MCMCglmm
can fit (I've sometimes found brms
fits models I can't fit with MCMCglmm
).
You may need to toy around with the syntax, but it would be something like this:
MCMCglmm(data = data, family = "poisson",
fixed = count ~ year,
random = ~ team)
A downside is that I have found it difficult in the past to find many online code examples that are connected to an explicit mathematical formulation of the models--it can be difficult to judge whether you are fitting the model you intent to fit. However, your model seems simple enough.
Upvotes: 1
Reputation: 73
For general speed improvements, I would suggest using openBLAS instead of the native BLAS. Unfortunately, I don't believe LME4 relies on BLAS.
However, I can also suggest generating the LME4 models in parallel, which would effectively cut your wait time in half.
Upvotes: 0