record linear regression results repeatly

Question

As shown in the following example, what I want to achieve is to run the regression many times, each time R records the estimates of did in one data.frame.

Each time, I changed the year condition in "ifelse", ie., ifelse(mydata$year >= 1993, 1, 0), thus each time I run a different regression.

mydata$time = ifelse(mydata$year >= 1994, 1, 0)

Can anyone help it? My basic code is as below (the data can be downloaded through browser if R returned errors):

library(foreign)
mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
mydata$did = mydata$time * mydata$treated
mydata$treated = ifelse(mydata$country == "E" | mydata$country == "F" | mydata$country == "G", 1, 0)
didreg = lm(y ~ treated + time + did, data = mydata)
summary(didreg)

Zheyuan Li · Accepted Answer

Generally if you want to repeat a process many times with some different input each time, you need a function. The following function takes a scalar value year_value as its input, creates local variables for regression and exports estimates for model term did.

foo <- function (year_value) {
  ## create local variables from `mydata`
  y <- mydata$y
  treated <- as.numeric(mydata$country %in% c("E", "F", "G"))  ## use `%in%`
  time <- as.numeric(mydata$year >= year_value)  ## use `year_value`
  did <- time * treated
  ## run regression using local variables
  didreg <- lm(y ~ treated + time + did)
  ## return estimate for model term `did`
  coef(summary(didreg))["did", ]
  }

foo(1993)
#     Estimate    Std. Error       t value      Pr(>|t|) 
#-2.784222e+09  1.504349e+09 -1.850782e+00  6.867661e-02

Note there are several places where your original code can be improved. Say, using "%in%" instead of multiple "|", and using as.numeric instead of ifelse to coerce boolean to numeric.

Now you need something like a loop to iterate this function over several different year_value. I would use lappy.

## raw list of result from `lapply`
year_of_choice <- 1993:1994  ## taken for example
result <- lapply(year_of_choice, foo)

## rbind them into a matrix
data.frame(year = year_of_choice, do.call("rbind", result), check.names = FALSE)
#  year    Estimate Std. Error   t value   Pr(>|t|)
#1 1993 -2784221881 1504348732 -1.850782 0.06867661
#2 1994 -2519511630 1455676087 -1.730819 0.08815711

Note, don't include year 1990 (the minimum of variable year) as a choice, otherwise time will be a vector of 1, as same as the intercept. The resulting model is rank-deficient and you will get "subscript out of bounds" error. R version since 3.5.0 has a new complete argument to generic function coef. So for stability we may use

coef(summary(didreg), complete = TRUE)["did", ]

But you should see all NA or NaN for year 1990.

record linear regression results repeatly

Answers (2)

Related Questions