user3032689
user3032689

Reputation: 667

perform many regressions row by row

I have a matrix (5,000 x 5,000), which is my dependent variable over time, and I have several matrices of independent variables over time, in the same format. Both matrices contain NA's from time to time, so these have to be taken care of via na.exclude.

I made some sample data to illustrate my problem:

y <- matrix(rnorm(25000),5000,5000)
x1 <- matrix(rnorm(25000),5000,5000)
x2 <- matrix(rnorm(25000),5000,5000)
x3 <- matrix(rnorm(25000),5000,5000)
x4 <- matrix(rnorm(25000),5000,5000)
x5 <- matrix(rnorm(25000),5000,5000)
x6 <- matrix(rnorm(25000),5000,5000)

lx <- list()

test <- lapply(1:nrow(y), function(i){lm(y[i,]~x1[i,]+x2[i,]+x3[i,]+x4[i,]+x5[i,]+x6[i,],na.action="na.exclude")})

But note I do not have any NAs here (I don't know how I can sample NA data as well?). When I run the regressions with my real data, it takes up to 10 minutes. With this data here it is much quicker, probably because there are no NAs. 10 minutes is not all to long, but I would like to optimize the speed because I have to do it many times.

Q: Is there any way to run the regressions more quickly? In the end I especially need the coefficients of all the regressions, but also R^2 any maybe more information later on. I do not think I can avoid a loop if I want to perform regressions row by row. From what I read, the costly part here seems to be the generation of lm objects itself. Thanks for any hints!

Upvotes: 1

Views: 92

Answers (1)

nevrome
nevrome

Reputation: 1550

Maybe RcppArmadillo::fastLm() is ok for your purpose. It's a pretty simple implementation and maybe the regression algorithm does not perform good enough for your purpose. But it's pretty fast.

y <- matrix(rnorm(250000), 500, 500)
x1 <- matrix(rnorm(250000), 500, 500)
x2 <- matrix(rnorm(250000), 500, 500)
x3 <- matrix(rnorm(250000), 500, 500)
x4 <- matrix(rnorm(250000), 500, 500)
x5 <- matrix(rnorm(250000), 500, 500)
x6 <- matrix(rnorm(250000), 500, 500)

library(RcppArmadillo)
library(rbenchmark)

benchmark(
  lm = {lapply(
    1:nrow(y), 
    function(i){
      lm(
        y[i,]~x1[i,]+x2[i,]+x3[i,]+x4[i,]+x5[i,]+x6[i,],
        na.action = "na.exclude"
      )
    }
  )},
  fastLmPure = {lapply(
    1:nrow(y), 
    function(i){
      fastLmPure(
        X = as.matrix(data.frame(x1[i,], x2[i,], x3[i,], x4[i,], x5[i,], x6[i,])),
        y = y[i,]
      )
    }
  )},
  replications = 10
)

One random result:

        test replications elapsed relative user.self sys.self user.child sys.child
2 fastLmPure           10   4.690    1.000      4.69        0          0         0
1         lm           10  11.143    2.376     11.13        0          0         0

Upvotes: 1

Related Questions