Petr
Petr

Reputation: 1817

broom::augment too slow with panel mixed model

I would like to ask,

when doing mixed panel data models, and then retriving complete regression output using broom::augment

it is very slow as as oppose to different model such as random, fixed effect

example:

#Load packages
library(foreign)
library(plm)
library(stargazer)
library(dplyr)
library(tidyverse)
library(quantreg)

#Load in Wooldridge data on crime
crime <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/crime4.dta")

#Declare our data to be a panel data set
crime.p <- pdata.frame(crime,index=c("county","year"))
crime.p %>% head

panel1 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "within")
panel2 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "random")
panel3 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "pool")
panel4 <- rq(crmrte ~ polpc + prbconv + avgsen + density + factor(county)-1, data = crime.p)
panel5 <- lmer(crmrte ~ polpc + prbconv + avgsen + density + (1 | county), data = crime.p)

broom::augment(panel1)
broom::augment(panel2)
broom::augment(panel4)

# This one is very slow
broom::augment(panel5)

In this it is slow, however with bigger datasets it can took even 20 minutes, is there a way how to retrive output that would be given by broom::augment(mixed) but much faster?

Upvotes: 2

Views: 353

Answers (1)

Chris
Chris

Reputation: 3986

panel5 isn't particularly slow on my machine (~0.01seconds). You can see where the most time is taken by calculating each column separately and timing:

I created a custom function to calculate each of the elements of the augment result:

fe <- function(panel){
  f <- fixef(panel)
  int <- f[1]
  coef <- f[-1]
  d <- panel5@frame
  d1 <- d[names(f) %in% names(d)]

  int + rowSums(mapply(`*`, d1, coef))
}

augment2 <- function(panel) {
  d <- panel@frame
  p <- predict(panel)
  r <- resid(panel)
  hat <- hatvalues(panel)
  cook <- cooks.distance(panel)
  fe <- fe(panel)
  mu <- panel@resp$mu
  wr <- panel@resp$wtres
  cbind(d,fitted=p,redidual = r,hat,cooksd=cook,fixed=fe,mu,wtres=wr)
}


If we time the two, they are pretty similar since all I've done is calculate the same things that augment calculates:

microbenchmark::microbenchmark(
  augment = broom::augment(panel5),
  cust = augment2(panel5)
)
#Unit: milliseconds
#    expr       min       lq     mean   median       uq      max neval
# augment 12.332313 12.85186 16.83977 13.31599 15.69942 59.84045   100
#    cust  9.976812 10.38016 11.94285 10.66314 11.32606 43.50708   100

If we take out each of the columns in turn from our custom augment2 function, we can see that most of the time is taken in calculating hatvalues and cooks.distance. If you don't really need one or both of these columns, it will speed your calculation up significantly. Excluding both, we can see a 25x increase in speed:

augment3 <- function(panel) {
  d <- panel@frame
  p <- predict(panel)
  r <- resid(panel)
  #hat <- hatvalues(panel)
  #cook <- cooks.distance(panel)
  fe <- fe(panel)
  mu <- panel@resp$mu
  wr <- panel@resp$wtres
  cbind(d,fitted=p,redidual = r,fixed=fe,mu,wtres=wr)
}

microbenchmark::microbenchmark(
  augment = broom::augment(panel5),
  cust = augment3(panel5)
)

#Unit: microseconds
#    expr       min        lq      mean    median         uq       max neval
# augment 12549.486 12924.666 17760.754 13341.712 15480.9555 87276.359   100
#    cust   406.818   474.119   694.698   532.005   586.5685  4702.398   100

Basically the time taken is due to cooks.distance and hat.values which I can't imagine could be calculated much more efficiently than they already are.

Upvotes: 3

Related Questions