Reputation: 1363
The dataset dt
is composed by 166 thousand rows, whose columns are:
log_price: dependent numerical regressor
sku: independent categorical regressor with 381 levels
year: independent categorical regressor with 15 levels
transaction_type: independent categorical regressor with 2 levels
purchaser: independent categorical regressor with 1001 levels
regressor_01, ..., regressor_04: four independent numerical regressors
We only consider skus that appears more than 200 rows along the dataset.
The model_1
regression took a couple minutes and gave nice results:
model_1<- lmer (formula = log_price ~
0 +
transaction_type +
regressor_01 +
regressor_02 +
regressor_03 +
regressor_04 +
year +
(1 | sku) +
(1 + year | purchaser)
, data = dt)
model _2
is similar to model_1
, the difference is that I consider sku
as a fixed effect instead of a random effect:
model_2<- lmer (formula = log_price ~
0 +
transaction_type +
regressor_01 +
regressor_02 +
regressor_03 +
regressor_04 +
year +
sku +
(1 + year | purchaser)
, data = dt)
However, model_2
a) took more than 48 hours running, b) ended abruptly without sending any warnings or errors, and c) keeps optmizing for the ninth significant algarism (see output below).
On other occasion I tried to speed it up with:
control = lmerControl(optimizer = "optimx",calc.derivs = FALSE, optCtrl = list(method = "nlminb", starttests = FALSE, kkt = FALSE)
. I gave up to optmize because it was ending abruptly, so I took it off to make sure it was not due to the control instruction.
I can´t understand why it converges so nicely when sku is a random effect, and does not converge whem sku become a fixed effect.
What might I am doing wrong? Any tips?
Last lines of output:
iteration: 19880
f(x) = 272182.459680
iteration: 19881
f(x) = 272182.459677
iteration: 19882
f(x) = 272182.459672
iteration: 19883
f(x) = 272182.459669
iteration: 19884
f(x) = 272182.459669
iteration: 19885
f(x) = 272182.459672
iteration: 19886
f(x) = 272182.459665
iteration: 19887
f(x) = 272182.459665
Session Info:
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stargazer_5.2.2 ivreg_0.5-0 ggplot2_3.3.2 lme4_1.1-25 Matrix_1.2-18 plm_2.2-5
[7] future.apply_1.6.0 future_1.20.1 magrittr_2.0.1 data.table_1.13.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 bdsmatrix_1.3-4 lattice_0.20-41 listenv_0.8.0 zoo_1.8-8 digest_0.6.27
[7] lmtest_0.9-38 parallelly_1.21.0 R6_2.5.0 cellranger_1.1.0 pillar_1.4.7 Rdpack_2.1
[13] miscTools_0.6-26 rlang_0.4.8 curl_4.3 readxl_1.3.1 rstudioapi_0.13 minqa_1.2.4
[19] car_3.0-10 nloptr_1.2.2.2 splines_4.0.2 statmod_1.4.35 foreign_0.8-80 munsell_0.5.0
[25] tinytex_0.27 numDeriv_2016.8-1.1 compiler_4.0.2 xfun_0.19 pkgconfig_2.0.3 globals_0.13.1
[31] maxLik_1.4-4 tidyselect_1.1.0 tibble_3.0.4 rio_0.5.16 codetools_0.2-16 crayon_1.3.4
[37] dplyr_1.0.2 withr_2.3.0 MASS_7.3-51.6 rbibutils_2.0 grid_4.0.2 nlme_3.1-148
[43] gtable_0.3.0 lifecycle_0.2.0 scales_1.1.1 zip_2.1.1 stringi_1.5.3 carData_3.0-4
[49] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.5 optimx_2020-4.2 boot_1.3-25 sandwich_3.0-0
[55] openxlsx_4.2.3 Formula_1.2-4 tools_4.0.2 forcats_0.5.0 glue_1.4.2 purrr_0.3.4
[61] hms_0.5.3 abind_1.4-5 parallel_4.0.2 yaml_2.2.1 colorspace_2.0-0 gbRd_0.4-11
[67] haven_2.3.1
Upvotes: 2
Views: 1188
Reputation: 226162
The main issue is that making sku
fixed will explode the size of the fixed-effect model matrix (X
, if you're reading along in vignette("lmer")
). The random effects model matrix (Z
) is coded as a sparse indicator matrix; the fixed effects model matrix is dense. Adding sku
will increase the fixed effects model matrix by 381 columns, or (8*381*166e3)/2^20
= 483 Mb. Provided you have the memory available that's not necessarily going to kill you, but it's not surprising that the required matrix operations are going to be a lot slower on a huge dense matrix than on its sparse equivalent.
It's not clear whether "ending abruptly" means that the R command quits (with an error?) and returns you to the prompt, or whether the entire R session stops. Having your entire R session quit unexpectedly is often a symptom of running out of memory (at least on Unix operating systems, the operating system will kill a process rather than allowing it to freeze the entire OS by grabbing more memory).
What can you do about this? It would be nice to able to specify that the fixed effect design matrix should be sparse ...
glmmTMB
allow sparse fixed-effect model matricessku
sparse but force the among-sku
variance to be very large, which is effectively making it back into a fixed effectYou might also try control=lmerControl(calc.derivs=FALSE)
; the slowness you're seeing at the end is the brute-force Hessian calculation. Since year is a categorical predictor, the covariance matrix associated with (1+year|purchaser)
is 15x15, with (15*16/2)
=120 parameters. Your computation will speed up a lot more if you can cut this down. For example, you could make year
numeric, fit a pretty complex spline model, and still save a lot of parameters: (1+splines::ns(year,df=4)|purchaser)
would only require 5*6/2=15 parameters. Further adding (1|purchaser:year)
will give you uncorrelated variation among years within purchasers (around the spline curve), and will only cost you one more parameter — you'll still have reduced the number of top-level parameters (the most important dimension of the problem) by a factor of 7.5.
Upvotes: 4