Reputation: 233
I generated some data in R
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
Next, I want to run a stepwise regression of y on the columns of X, from 1 to 30. First I only include the intercept, then only intercept and column one, then add column two, column three, and so on. I wrote the following code
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, i])
print(model)
}
What I see in the output now is that for each iteration, the regression is of y on an intercept and X[, i]
, i.e. the i-th column of X, and not the previous columns, even though I'm updating at every step. For example, when i = 4, the model is a regression of y
on an intercept and X[, 4]
, not all of columns 1, 2, 3, 4. Why does this happen?
Upvotes: 3
Views: 664
Reputation: 694
The reason your proposed code doesn't work is because of how R sees the formula and the fact that R updates the formula before it evaluates i
.
The source code for the relevant update method can be viewed by running update.default
at the command line. You'll see that after some error checking it runs call$formula <- update(formula(object), formula.)
, which calls the update.formula()
function. update.formula()
sees that you want to add the term X[, i]
into the formula and does that. But update.formula()
doesn't evaluate the value of i
at this point, it relies on "lazy evaluation". This can be seen more clearly if we expand out the loop.
form <- y ~ 1
form
#> y ~ 1
i <- 1
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
i <- 2
form <- update.formula(form, ~. +X[, i])
form
#> y ~ X[, i]
The formula is being updated with the symbol X[, i]
and then simplified to remove the duplicate symbol. This lazy evaluation is useful because it means that I don't need to actually define what X
of y
are for the above code to run. R trusts that I'll create appropriate objects before I try to use them.
After update()
has updated the formula, it eval()
's the updated call. At this time i
is evaluated and its current value is used. So in fact, this loop below gives the exact same output as your loop even though it doesn't try to change the formula at all. Each time lm()
runs it looks for the current value of i
to use.
for(i in 1:30){
model <- lm(y ~ X[, i])
print(model)
}
To achieve your desired effect you can programmatically create the formula outside the lm()
function, not using an update()
function. Like so,
n <- 1000; p <- 30
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
beta <- c(rep(1, 10), rep(0, 10), rep(-2, 10))
y <- X %*% beta + rnorm(1000)
xnames <- sapply(list(1:ncol(X)), function(x) paste0("X",x))
colnames(X) <- xnames
dat <- data.frame(y,X)
for(i in 1:30){
form <- as.formula(paste0("y ~ ", paste(xnames[1:i], collapse = "+")))
model <- lm(form, data = dat)
print(model)
}
EDIT:
After reading this post, https://notstatschat.rbind.io/2022/06/23/getting-strings-into-code-in-base-r/, an alternate way to perform the formula manipulations is to use bquote()
. This has the advantage that the model summary contains the correct formula.
for(i in 1:30){
model <- eval(bquote(update(model, ~. + .(as.name(xnames[[i]])))))
print(model)
}
Upvotes: 1
Reputation: 2485
Try this
model <- lm(y~1)
for(i in 1:30){
model <- update(model, ~.+X[, 1:i])
print(model)
}
Upvotes: 1