Reputation: 1817
I would like to ask,
when doing mixed panel data models, and then retriving complete regression output using broom::augment
it is very slow as as oppose to different model such as random, fixed effect
example:
#Load packages
library(foreign)
library(plm)
library(stargazer)
library(dplyr)
library(tidyverse)
library(quantreg)
#Load in Wooldridge data on crime
crime <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/crime4.dta")
#Declare our data to be a panel data set
crime.p <- pdata.frame(crime,index=c("county","year"))
crime.p %>% head
panel1 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "within")
panel2 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "random")
panel3 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "pool")
panel4 <- rq(crmrte ~ polpc + prbconv + avgsen + density + factor(county)-1, data = crime.p)
panel5 <- lmer(crmrte ~ polpc + prbconv + avgsen + density + (1 | county), data = crime.p)
broom::augment(panel1)
broom::augment(panel2)
broom::augment(panel4)
# This one is very slow
broom::augment(panel5)
In this it is slow, however with bigger datasets it can took even 20 minutes, is there a way how to retrive output that would be given by broom::augment(mixed) but much faster?
Upvotes: 2
Views: 353
Reputation: 3986
panel5
isn't particularly slow on my machine (~0.01seconds). You can see where the most time is taken by calculating each column separately and timing:
I created a custom function to calculate each of the elements of the augment
result:
fe <- function(panel){
f <- fixef(panel)
int <- f[1]
coef <- f[-1]
d <- panel5@frame
d1 <- d[names(f) %in% names(d)]
int + rowSums(mapply(`*`, d1, coef))
}
augment2 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
hat <- hatvalues(panel)
cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,hat,cooksd=cook,fixed=fe,mu,wtres=wr)
}
If we time the two, they are pretty similar since all I've done is calculate the same things that augment
calculates:
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment2(panel5)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# augment 12.332313 12.85186 16.83977 13.31599 15.69942 59.84045 100
# cust 9.976812 10.38016 11.94285 10.66314 11.32606 43.50708 100
If we take out each of the columns in turn from our custom augment2
function, we can see that most of the time is taken in calculating hatvalues
and cooks.distance
. If you don't really need one or both of these columns, it will speed your calculation up significantly. Excluding both, we can see a 25x increase in speed:
augment3 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
#hat <- hatvalues(panel)
#cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,fixed=fe,mu,wtres=wr)
}
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment3(panel5)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# augment 12549.486 12924.666 17760.754 13341.712 15480.9555 87276.359 100
# cust 406.818 474.119 694.698 532.005 586.5685 4702.398 100
Basically the time taken is due to cooks.distance
and hat.values
which I can't imagine could be calculated much more efficiently than they already are.
Upvotes: 3