user
user

Reputation: 7

For loops: Running through column names

I was looking for a shorter way to write this using for loops

ie: i is 1 to 22 and my data will add columns 1 through 22 in the multiple regression:

reg <-lm(log(y)~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z1+z+z3+z4+z5+z6+z7+z8+z9+z10+z11+z12, data)

To clarify, x1 and x2 and x3 are all column names - they are x two (not x squared), I am trying to do a multiple regression with the last 22 columns in my data set

Someone suggested to do this:

reg1 <- lm(log(data$y)~terms( as.formula( 
  paste(" ~ (", paste0("X", 29:ncol(data) , collapse="+"), ")")
)         
))

But

  1. It doesn't work
  2. I don't think it is doing multiple regression (xone + xtwo+ xthree), rather it assigned the binary value 1 to each variable x1, x2, x3... and added them, which is not what I want.

Upvotes: 0

Views: 198

Answers (2)

IRTFM
IRTFM

Reputation: 263441

I know that a for-loop was requested but it would have been a clumsy strategy, so here's a possible correct strategy:

formchr <- paste(  
            paste( "log(y)" , paste0( "x", 1:10, collapse="+"), sep="~"),  
                   # the LHS and first 10 terms
                                paste0( "z", 1:12, collapse="+"), #next 12 terms
                   sep="+")   # put both parts together
reg1 <- lm( as.formula(formchr), data=data)

The full character-version of the formula should be passed to the as.formula function and the paste and paste0 functions are fully vectorized, so no loop is needed.

If the first 22 columns were the desired target for the RHS terms, you could have pasted together names(data)[1:22] or ...[29:50] if those were hte locations, and htis would be substituted for the RHS terms in the second paste above, dropping the third paste.

The only reason I used data as the name of an object is that it was implied by the question. It is a very confusing practice to use that name. data is an R function and objects should have specific names that do not overlap with function names. The other very commonly abused name in this regard is df, which is the density function for the distribution.

Upvotes: 1

Justin
Justin

Reputation: 1410

You could first subset your data into a data.frame which contains only the columns of interest. Then, you can run a linear model using the . formula syntax to select all columns other than the y variable.

Example using 1000 rows and 50 cols of data

N <- 1000
P <- 50
data <- as.data.frame(rep(data.frame(rnorm(N)), P))

Assign your y data to y.

y <- as.data.frame(rep(data.frame(rnorm(N)), 1))

Create a new data.frame containing y and the last 22 columns.

   model_data <- cbind(y, data[ ,29:50])
   colnames(model_data) <- c("y", paste0("x", 1:10), paste0("z",1:12))

The following should do the trick. The . formula syntax will select all columns other than the y column.

 reg <-lm(log(y) ~ ., data = model_data)

Upvotes: 0

Related Questions