pink99
pink99

Reputation: 51

emmeans function won't run or takes too long to run

I am new. I want to use the emmeans function to calculate estimated marginal means based on a model. This model is done by lmer function. The problem is I have lots (20ish) of fixed effect variables and one random effect variable. I can run lmer with no problem. By the way I set the 20ish categorical variables as factors before I run the lmer. When I use emmeans, the error shows

Error: cannot allocate vector of size 49391.4 Gb

I know it is a memory issue. If I use 2-3 variables to build model, the emmeans function will run, although it takes 20 minutes to finish. The dataset is quite big (20 k). Does anyone experience the same thing? Or I should use a different function? Is there anyway to make it work in R? I am a spss user, it seems it does not take spss long to calculate this, I do not understand why I can not run it in R.

My R script looks like this:

mod1 <- lmer(overall ~ age + gender + job + a + b + ... + c + (1 | groupcode), data=dat, REML=T)
res1 <- emmeans::emmeans(mod1, specs = "age")
res2 <- emmeans::emmeans(mod1, specs = "gender")
...

Follow up: hi, I have found some free data online, so I can try to replicate the issue. I could not replicate the issue 100%, but it shows the problem that emmeans function takes too long. If I have a bigger dataset and with more variables, it won't run at all. Here are the codes:


library(dplyr)
library(stringr)

rm(list = ls())

#data source
#http://www.bristol.ac.uk/cmm/learning/support/datasets/
#bottom of the page: Multilevel ordinal models for examination grades database (zip, 0.9 mb)
#unzip the file and saved under cc:\momeg\
#I used file :a-level-geography.txt


#import data
dat <- read.csv("C:\\momeg\\a-level-geography.txt", header = FALSE,  sep = "")

#assign column names
colnames(dat) <- c("A-SCORE",   "BOARD", "GCSE-G-SCORE", "GENDER", "GTOT", "GNUM", "GCSE-MA-MAX", "GCSE-math-n", "AGE", 
                   "INST-GA-MN", "INST-GA-SD", "INSTTYPE", "LEA", "INSTITUTE", "STUDENT") %>% 
                   tolower(.) %>%
                   str_replace_all(., "-", "_")
#number of records
nrow(dat)

#centering score
dat$'a_score' <- dat$'a_score'- mean(dat$'a_score')



#setup catorgorical variables as factor
dat$gender <- factor(dat$gender)
dat$age <- factor(dat$age)
dat$gcse_g_score <- factor(dat$gcse_g_score)
dat$gcse_math_n <- factor(dat$gcse_math_n)
dat$insttype <- factor(dat$insttype)


library(lme4)
library(emmeans)

#run model

mod1 <- lmer(a_score ~ age + gender + gcse_g_score + gcse_math_n + insttype + (1 | institute), data=dat, REML=T)
summary(mod1)

#get emmean

emm_options(pbkrtest.limit = 50000) #increase the limit to aviod note about d.f to be disabled.

start.time <- Sys.time() #figure out how long it taks r to run the emmeans function
age.means <- emmeans::emmeans(mod1, specs = "age")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

I have run the emmeans function for over an hour now and it is still running. Why it takes so long?

Upvotes: 5

Views: 992

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21757

Not sure whether this does exactly the same thing, but it appears to be similar in the few cases I've tried. The big difference is the degrees of freedom used, ggpredict() doesn't use the Kenward-Roger (or any other) correction to the DF.

library(lme4)
fm2 <- lmer(Reaction ~ Days + (Days || Subject), sleepstudy)
emmeans::emmeans(fm2, specs="Days")
# Days emmean   SE df lower.CL upper.CL
# 4.5    299 8.88 25      280      317
# 
# Degrees-of-freedom method: kenward-roger 
# Confidence level used: 0.95 

library(ggeffects)
m <- mean(sleepstudy$Days)
ggpredict(fm2, terms="Days [m]")

# # Predicted values of Reaction
# # x = Days
# 
#    x | Predicted |   SE |           95% CI
# ------------------------------------------
# 4.50 |    298.51 | 8.88 | [281.11, 315.91]
# 
# Adjusted for:
# * Subject = 0 (population-level)

Upvotes: 2

Related Questions