vijkrishb
vijkrishb

Reputation: 23

Applying fixed effects factor in R breaks the regression

I am trying to run a fixed effects regression in R. When I run the linear model without the fixed effects factor being applied the model works just fine. But when I apply the factor - which is a numeric code for user ID, I get the following error:

Error in rep.int(c(1, numeric(n)), n - 1L) : cannot allocate vector of length 1055470143

I am not sure what the error means but I fear it may be an issue of coding the variable correctly in R.

Upvotes: 2

Views: 1164

Answers (2)

Metrics
Metrics

Reputation: 15458

I think this is more statistical and less programming problem for two reasons:

First, I am not sure whether you are using cross sectional data or panel data. If you using cross-sectional data it doesn't make sense to control for 30000 individuals(of course, they will add to variation).

Second, if you are using panel data, there are good package such as plm package in R that does this kind of computation.

Upvotes: 1

Roland
Roland

Reputation: 132696

An example:

set.seed(42)
DF <- data.frame(x=rnorm(1e5),id=factor(sample(seq_len(1e3),1e5,TRUE)))
DF$y <- 100*DF$x + 5 + rnorm(1e5,sd=0.01) + as.numeric(DF$id)^2

fit <- lm(y~x+id,data=DF)

This needs almost 2.5 GB RAM for the R session (if you add RAM needed by the OS this is more than many PCs have available) and takes some time to finish. The result is pretty useless.

If you don't run into RAM limitations you can suffer from limitations of vector length (e.g., if you have even more factor levels), in particular if you use an older version of R.

What happens?

One of the first steps in lm is creating the design matrix using the function model.matrix. Here is a smaller example of what happens with factors:

model.matrix(b~a,data=data.frame(a=factor(1:5),b=2))

#   (Intercept) a2 a3 a4 a5
# 1           1  0  0  0  0
# 2           1  1  0  0  0
# 3           1  0  1  0  0
# 4           1  0  0  1  0
# 5           1  0  0  0  1
# attr(,"assign")
# [1] 0 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$a
# [1] "contr.treatment"

See how n factor levels result in n-1 dummy variables? If you have many factor levels and many observations, this matrix gets huge.

What should you do?

I'm quite sure, you should use a mixed effects model. There are two important packages that implement linear mixed effects models in R, package nlme and the newer package lme4.

library(lme4)

fit.mixed <- lmer(y~x+(1|id),data=DF)
summary(fit.mixed)

Linear mixed model fit by REML 
Formula: y ~ x + (1 | id) 
Data: DF 
    AIC     BIC  logLik deviance REMLdev
1025277 1025315 -512634  1025282 1025269
Random effects:
  Groups   Name        Variance   Std.Dev. 
id       (Intercept) 8.9057e+08 29842.472
Residual             1.3875e+03    37.249
Number of obs: 100000, groups: id, 1000

Fixed effects:
             Estimate Std. Error t value
(Intercept) 3.338e+05  9.437e+02   353.8
x           1.000e+02  1.180e-01   847.3

Correlation of Fixed Effects:
  (Intr)
x 0.000

This needs very little RAM, calculates fast, and is a more correct model.

See how the random intercept accounts for most of the variance?

So, you need to study mixed effects models. There are some nice publications, e.g. Baayen, Davidson, Bates (2008), explaining how to use lme4.

Upvotes: 0

Related Questions