Jakob
Jakob

Reputation: 1453

Good packages for large panels with many 'fixed effects' / individual specific intercepts

I have made very good experiences with plm() in small datasets like this one:

library("data.table")
library("plm")
set.seed(123)
smalldata <- data.table(id = rep(1:100, each = 10), x = rnorm(1000))
smalldata[, y := x + 0.3*x + (rnorm(1000)/100)]

plm(smalldata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")

But if I have a larger dataset, I quickly run into ram issues and things become very(!) slow

set.seed(123)
largedata <- data.table(id = rep(1:100000, each = 10), x = rnorm(1000000))
largedata[, y := x + 0.3*x + (rnorm(1000000)/100)]

time_pre <- Sys.time()
plm(largedata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")
time_post <- Sys.time()
time_post-time_pre

This already takes longer then a minute on my machine and my real life datasets are much larger than 1 million observations. If you want to try it out, just add a few 0's and wait until your RAM fills up and the code simply doesn't run through anymore. Of course the problem gets worse faster as more level of fixed effects are added (you can do that in plm using dummy regressions with as.factor()) etc pp but the problem is already visible in this simple example.

This slowness is also odd because the plm() documentation tells me that it uses within transformations which should be fast. They must be doing that using complicated functions because here is that same 'fixed effects' panel using a data.table() within transformations and then lm():

time_pre_dt <- Sys.time()
largedata[, x_demeaned := x - mean(x), by = id]
largedata[, y_demeaned := y - mean(y), by = id]
lm(largedata, formula = y_demeaned ~ -1 + x_demeaned)
time_post_dt <- Sys.time()
time_post_dt - time_pre_dt
as.double(time_post-time_pre, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

On my machine, this is 70-125 times faster.

So the ingredients for much faster panels are there but plm() does not seem to take advantage of them. Are there any other panel packages that are more useful for large 'fixed effects' panels? I am hesitant of workarounds (like this within transformation) because I am not always sure how standard errors are calculated in different versions of coeftest() etc (same problem in Stata with small sample adjustments etc). I'm looking for developed packages and would like to avoid coding everything from scratch.

edit:

for benchmarking with the answer of @Helix123 below, here the attempt with plm() 'fast':

install.packages("collapse")
install.packages("fixest")
install.packages("lfe")

# (restart R session)

options("plm.fast" = TRUE)

time_pre_fast <- Sys.time()
plm(largedata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")
time_post_fast <- Sys.time()

time_post_fast - time_pre_fast

as.double(time_post_fast - time_pre_fast, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

The within transformation example is still 90 times faster on my machine. However, felm() does speed up things quite considerably! Here the example:

library("lfe")

time_pre_felm <- Sys.time()
felm(largedata, formula = y ~ x | id)
time_post_felm <- Sys.time()

as.double(time_post_felm - time_pre_felm, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

This takes only twice as long as the data.table() within transformation followed by lm()

feols() from the fixest package is the winner, see answer below.

Upvotes: 0

Views: 851

Answers (2)

Helix123
Helix123

Reputation: 3687

edit: plm version 2.6-0 has the fast mode enabled by default. A further speed up for plm is gained if package fixest (or lfe) is also installed. For the benchmarked example in ?plm.fast, this gives a speed up to a 28x speed-up.

old answer Since version 2.4-0 of plm, there is an optional fast mode, see the NEWS entry for 2.4-0 and 2.4-2: https://cran.r-project.org/web/packages/plm/news/news.html

To enable the fast mode, set this option:

options("plm.fast" = TRUE)

You can set this option manually, in your script, or in your .Rprofile file.

You will need to have package collapse installed as well, optionally package fixest or lfe for further speed up for some cases, esp. two-ways fixed effects from plm 2.4-2 onwards.

For a benchmarked example, see the help for ?plm.fast.

In your example, much time is spent to create a pdata.frame (the data format plm uses), it happens implicitly in the call to plm if the data is not explicitly converted first. So, one could split out the formating to pdata.frame and the model estimation. So, the execution time of plm() is significantly smaller. Also, this would speed up any further model estimation after the first one on the data set.

# convert data set to a pdata.frame:
plargedata <- pdata.frame(largedata, index = "id")

# use the pdata.frame for model estimation:
plm(plargedata,
     formula = y ~ x,
     effect = c("individual"),
     model = c("within"))

You can also have a look at packages fixest and lfe for fixed effect estimations; they have a slightly different syntax. If you do not want to change your code, plm's fast mode should get you there. However if you have much more than two fixed effects, you would need to include them in plm's formula with +factor(fe3) + factor(fe4) + ... and these cases will likely be faster with packages lfe and fixest if there are a lot of fixed effects with many levels.

Upvotes: 1

Jakob
Jakob

Reputation: 1453

Based on @Helix123's hint to look into the fixest package, here is the answer:

library("fixest")

time_pre_fixest <- Sys.time()
feols(y ~ x | id, largedata)
time_post_fixest <- Sys.time()

as.double(time_post_fixest - time_pre_fixest, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

Takes a tenth of the time of the fastest version in my question.

edit: You can still run into RAM issues of course but they are much less likely.

Upvotes: 0

Related Questions