Reputation: 51
I am new. I want to use the emmeans
function to calculate estimated marginal means based on a model. This model is done by lmer
function. The problem is I have lots (20ish) of fixed effect variables and one random effect variable. I can run lmer
with no problem. By the way I set the 20ish categorical variables as factors before I run the lmer
. When I use emmeans
, the error shows
Error: cannot allocate vector of size 49391.4 Gb
I know it is a memory issue. If I use 2-3 variables to build model, the emmeans function will run, although it takes 20 minutes to finish. The dataset is quite big (20 k). Does anyone experience the same thing? Or I should use a different function? Is there anyway to make it work in R? I am a spss user, it seems it does not take spss long to calculate this, I do not understand why I can not run it in R.
My R script looks like this:
mod1 <- lmer(overall ~ age + gender + job + a + b + ... + c + (1 | groupcode), data=dat, REML=T)
res1 <- emmeans::emmeans(mod1, specs = "age")
res2 <- emmeans::emmeans(mod1, specs = "gender")
...
Follow up: hi, I have found some free data online, so I can try to replicate the issue. I could not replicate the issue 100%, but it shows the problem that emmeans function takes too long. If I have a bigger dataset and with more variables, it won't run at all. Here are the codes:
library(dplyr)
library(stringr)
rm(list = ls())
#data source
#http://www.bristol.ac.uk/cmm/learning/support/datasets/
#bottom of the page: Multilevel ordinal models for examination grades database (zip, 0.9 mb)
#unzip the file and saved under cc:\momeg\
#I used file :a-level-geography.txt
#import data
dat <- read.csv("C:\\momeg\\a-level-geography.txt", header = FALSE, sep = "")
#assign column names
colnames(dat) <- c("A-SCORE", "BOARD", "GCSE-G-SCORE", "GENDER", "GTOT", "GNUM", "GCSE-MA-MAX", "GCSE-math-n", "AGE",
"INST-GA-MN", "INST-GA-SD", "INSTTYPE", "LEA", "INSTITUTE", "STUDENT") %>%
tolower(.) %>%
str_replace_all(., "-", "_")
#number of records
nrow(dat)
#centering score
dat$'a_score' <- dat$'a_score'- mean(dat$'a_score')
#setup catorgorical variables as factor
dat$gender <- factor(dat$gender)
dat$age <- factor(dat$age)
dat$gcse_g_score <- factor(dat$gcse_g_score)
dat$gcse_math_n <- factor(dat$gcse_math_n)
dat$insttype <- factor(dat$insttype)
library(lme4)
library(emmeans)
#run model
mod1 <- lmer(a_score ~ age + gender + gcse_g_score + gcse_math_n + insttype + (1 | institute), data=dat, REML=T)
summary(mod1)
#get emmean
emm_options(pbkrtest.limit = 50000) #increase the limit to aviod note about d.f to be disabled.
start.time <- Sys.time() #figure out how long it taks r to run the emmeans function
age.means <- emmeans::emmeans(mod1, specs = "age")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
I have run the emmeans function for over an hour now and it is still running. Why it takes so long?
Upvotes: 5
Views: 992
Reputation: 21757
Not sure whether this does exactly the same thing, but it appears to be similar in the few cases I've tried. The big difference is the degrees of freedom used, ggpredict()
doesn't use the Kenward-Roger (or any other) correction to the DF.
library(lme4)
fm2 <- lmer(Reaction ~ Days + (Days || Subject), sleepstudy)
emmeans::emmeans(fm2, specs="Days")
# Days emmean SE df lower.CL upper.CL
# 4.5 299 8.88 25 280 317
#
# Degrees-of-freedom method: kenward-roger
# Confidence level used: 0.95
library(ggeffects)
m <- mean(sleepstudy$Days)
ggpredict(fm2, terms="Days [m]")
# # Predicted values of Reaction
# # x = Days
#
# x | Predicted | SE | 95% CI
# ------------------------------------------
# 4.50 | 298.51 | 8.88 | [281.11, 315.91]
#
# Adjusted for:
# * Subject = 0 (population-level)
Upvotes: 2