Is it possible to cache `lm()` matrices to fit new data?

Question

I wrote an algorithm which fits a linear model with lm() and then "updates" the response variable iteratively. The problem is: In a high-dimension scenario, fitting linear models creates a bottleneck.

On the other hand, most of the work required is a matrix inversion that only depends on the covariate matrix X, i.e., the coefficients are given by: solve(t(X) %*% X) %*% X %*% y.

Reading lm() code, I understand that R uses QR decomposition.

Is it possible to recover the internal matrix operation used and fit a new model with new y values faster?

Here's a minimal example:

set.seed(1)
X <- matrix(runif(400*150000), nrow = 150000)
y1 <- runif(150000)
y2 <- runif(150000)

mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)

Theoretically, mod2 "repeats" costful matrix operations identical to the ones made in the first lm() call.

I want to keep using lm() for its efficient implementation and ability to handle incomplete rank matrices automatically.

d.b · Accepted Answer

# Data
set.seed(1)
n = 5
X <- matrix(runif(5*n), nrow = n)
y1 <- runif(n)
y2 <- runif(n)

# lm models
mod1 <- lm(y1 ~ X)
mod2 <- lm(y2 ~ X)

# Obtain QR decomposition of X
q = qr(X)

# Reuse 'q' to obtain fitted values repeatedly
mod1_fv = qr.fitted(q, y1)
mod2_fv = qr.fitted(q, y2)

# Compare fitted values from reusing 'q' to fitted values in 'lm' models
Vectorize(all.equal)(unname(mod1$fitted.values), mod1_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE

Vectorize(all.equal)(unname(mod2$fitted.values), mod2_fv)
#> [1] TRUE TRUE TRUE TRUE TRUE

^{Created on 2019-09-06 by the reprex package (v0.3.0)}

Is it possible to cache `lm()` matrices to fit new data?

Answers (2)

Related Questions