MatthewR
MatthewR

Reputation: 2770

Optimize the time to run fixed effects in an R lm() model

I am trying to run a regression model that includes fixed effects for cities in the united states. I have over 10,000,000 million rows and 600 cities. The code below works, but it is really slow. When including a factor for a variable with lots of levels, is there any way to run the model faster.

x <- data.frame(
    a = sample( 1:1000, 1000000 , replace=T),
    cityfips = sample( 1:250, 1000000 , replace=T),
    d = sample( 1:4, 1000000 , replace=T)
)

system.time(a1 <- lm( a~cityfips+d  , x ) )
system.time(a2 <- lm( a~as.factor(cityfips) + d  , x ) )




> system.time(a1 <- lm( a~cityfips+d  , x ) )
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(a2 <- lm( a~as.factor(cityfips) + d  , x ) )
   user  system elapsed 
  95.65    0.97   96.62 
> system.time(a3 <- slm( a~as.factor(cityfips) + d  , x ) )
   user  system elapsed 
   4.58    2.06    6.65 

Upvotes: 0

Views: 881

Answers (2)

my_cabbages
my_cabbages

Reputation: 1

check out the lfe package. I've not dug into the details of the algorithm but at least in my experience it's produced exactly the same results as lm() in fractions of the time.

as a bonus it makes it easy to cluster standard errors so you don't need to do any clustering and/or sandwich estimator business afterward, although the syntax for doing so is a little unusual

Upvotes: 0

StupidWolf
StupidWolf

Reputation: 46898

When you have that many factors, constructing the model.matrix in lm() will take up most of the time, one way is to use sparseMatrix like in glmnet and there are two packages, sparseM, MatrixModels that allows lm onto sparseMatrix:

set.seed(111)
x <- data.frame(
    a = sample( 1:1000, 1000000 , replace=T),
    cityfips = sample( 1:250, 1000000 , replace=T),
    d = sample( 1:4, 1000000 , replace=T)
)

library(SparseM)
library(MatrixModels)
library(Matrix)

system.time(f_lm <- lm( a~as.factor(cityfips) + d  , x )  )
   user  system elapsed 
 75.720   2.494  79.365  
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d  , x ))
   user  system elapsed 
  5.373   3.952  10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d  ,data=x,sparse=TRUE))
   user  system elapsed 
  1.878   0.335   2.219

The closest I can find is glm4 in MatrixModels, but you can see below the coefficients are the same as the fit using lm:

all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE

One other option besides glm4 in MatrixModels is to use lm.fit (as pointed out by @BenBolker:

lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)

This gives you a list as like lm.fit() normally and you cannot apply functions such as summary() etc.

Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm , take care to check.

Upvotes: 3

Related Questions