Brennan
Brennan

Reputation: 429

How to run a for loop to run regressions by dummy variables

I have the following code:

reg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2 + d2 + d3 + d4, df)

Where all x_i are continuous variables and d_i are mutually exclusive dummy variables (d1 is present but exclude to avoid perfect multicollinearity). Rather than including the dummy variables, I want to run separate regressions for each dummy variable == 1. I wish to achieve this through a loop in the following form:

dummylist <- list("d1", "d2", "d3", "d4")
for(i in dummylist){
   if(i==1){
      ireg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2, df)
   } else {
      Unsure what to put here
   }
}

My three(?) questions are:

  1. in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."? and,
  2. included in the code above, what would I put after the -else- statement?
  3. This all begs the question, is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?

Sorry if this is too much, please let me know if it is and I can cut it down or separate into multiple questions. I could not find a similar question, probably as I am rather new to running loops in R and don't know what to look for.

Upvotes: 0

Views: 457

Answers (1)

Oliver
Oliver

Reputation: 8582

  1. in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."?

Short: No

In R there are many data types. One of the more versatile once is the list object, which can store any type of object. Alternatively one could create an environment to store the lists within, but that is a bit overkill.

If you know roughly how many elements should be in your list, the easiest is to initialize it prior to your loop as

n <- 3
regList <- vector(mode = "list", length = n)
# Optional naming:
#names(regList) <- c("d1 reg", "d2 reg", "d3 reg")

In your loop you then fill in your list iteratively:

for(i in seq_along(regList)){
   regList[[i]] <- lm(...)
}
  1. what would I put after the -else- statement? This all begs the question,

It is not entirely clear what you want here. Either you want to 'only' include the seperate dummy variables. For this the simplest is likely to save your formula and updating it iteratively.

form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
for(i in seq_along(regList)){
   #paste0 combine strings. ". ~ . + d1" means take the formula and add the element d1 
   form <- update(form, as.formula(paste0(". ~ . + d", i)) 
   regList[[i]] <- lm(form, data = df)
}

or maybe you are actually trying to run separate regressions on the subset where d[i] == 1. This can actually be done with lm itself

form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
d <- list(d1, d2, d3)
for(i in seq_along(regList)){
   #Using the subset argument
   regList[[i]] <- lm(form, data = df, subset = which(d[[i]] == 1))
   #Alternatively:
   #regList[[i]] <- lm(form, data = subset(df, d[[i]] == 1))
}

Disclaimer: It is not entirely clear if d1, d2, d3 is a part of df. In this case the example below would work

   regList[[i]] <- with(df, lm(form, subset = which(d[[i]] == 1)))
  1. is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop?

In this case it is not clearly the correct approach. But it isn't the wrong approach either in all circumstances. Here it just doesn't serve a clear purpose. And note that i in dummylist would return "d1", "d2", "d3", "d4" as the variables have been quoted, rather than directly placed within the list.

However another thing to address, is whether you have transformed the variables yourself, before performing your linear regression. Note that R's internal function allows you to do this directly in the formula, and doing this will allow it to help you avoid dummy-mistakes, such as testing variables for which an interaction exists, unless it is very very much what you wanted to do. For example i assume x1_sq = x1^2. Maybe d1, d2, d3 are all contained in a variable d? In these cases you should use the original variables as shown below:

lm(formula = Y ~ poly(x1, 2, raw = TRUE) + poly(x2, 2, raw = TRUE) + x1:x2, data = df ) #+d if d1, d2, d3 is part of the formula

poly being the second order polynomial and raw = TRUE returning the parameters as x1 + I(x1^2) rather than the orthogonal representation.

If one does this, the output of drop1, anova etc. will take into account that it should not test the first order variables to the second order interactions.

Upvotes: 1

Related Questions