Reputation: 2770
I am trying to run a regression model that includes fixed effects for cities in the united states. I have over 10,000,000 million rows and 600 cities. The code below works, but it is really slow. When including a factor for a variable with lots of levels, is there any way to run the model faster.
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
system.time(a1 <- lm( a~cityfips+d , x ) )
system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
> system.time(a1 <- lm( a~cityfips+d , x ) )
user system elapsed
0.22 0.00 0.22
> system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
95.65 0.97 96.62
> system.time(a3 <- slm( a~as.factor(cityfips) + d , x ) )
user system elapsed
4.58 2.06 6.65
Upvotes: 0
Views: 881
Reputation: 1
check out the lfe package. I've not dug into the details of the algorithm but at least in my experience it's produced exactly the same results as lm()
in fractions of the time.
as a bonus it makes it easy to cluster standard errors so you don't need to do any clustering and/or sandwich estimator business afterward, although the syntax for doing so is a little unusual
Upvotes: 0
Reputation: 46898
When you have that many factors, constructing the model.matrix in lm()
will take up most of the time, one way is to use sparseMatrix like in glmnet
and there are two packages, sparseM
, MatrixModels
that allows lm onto sparseMatrix:
set.seed(111)
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
library(SparseM)
library(MatrixModels)
library(Matrix)
system.time(f_lm <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
75.720 2.494 79.365
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d , x ))
user system elapsed
5.373 3.952 10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d ,data=x,sparse=TRUE))
user system elapsed
1.878 0.335 2.219
The closest I can find is glm4 in MatrixModels
, but you can see below the coefficients are the same as the fit using lm:
all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE
One other option besides glm4
in MatrixModels
is to use lm.fit (as pointed out by @BenBolker:
lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)
This gives you a list as like lm.fit()
normally and you cannot apply functions such as summary()
etc.
Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm
, take care to check.
Upvotes: 3