Ryan da Silva
Ryan da Silva

Reputation: 23

Looping regressions and running column sum based on results

I have a data frame with panel data that looks as follows:

 countrycode year   7111   7112   7119   7126   7129    7131    7132   7133    7138
1         AGO 1981 380491 149890 238832      0 166690  449982  710642 430481  890546
2         AGO 1982 339626  66434 183487      0  79682  108356  486799 186884  220545
3         AGO 1983 128043   2697  91404 148617   3988  432725  829958 138764  152822
4         AGO 1984  67832      0  85613   1251  45644  361733 1250272 237236 2952746
5         AGO 1985 354335  11225 143000   2130   7687 2204297  942071 408907  474666

There are 159 four-digit column variables like the ones shown above. There are also column variables named CEPI1_fw and CIPI1_fw. Furthermore, there are 46 countries and 34 years in the data set.

I would like to use the plm command to regress each of the numerical column variables on CEPI1_fw and CIPI1_fw. Then, I would like to sum the numerical column variables in the data frame above based on whether the coefficients from the regressions are above or below a certain threshold. The resulting output should be a pair of columns added to the data frame above.

Upvotes: 1

Views: 377

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226057

There are a few ambiguities in your question, but I'll take a shot.

First, I'm going to revamp your code slightly: adding rows to data frames is very inefficient (probably doesn't matter in this application, but it's a bad habit to get into ...)

out <- list()           
for (i in colnames(master5)) {
  f <- reformulate(c("CEPI1_fw","CIPI1_fw"), 
                   response=paste0("master5$",i))
  m <- summary(plm(f, data = master4, model = "within"))
  out <- c(out, list(data.frame(yvar=i, coef=m$coefficients[1,1],
                           pval= m$coefficients[1,4],
                           stringsAsFactors=FALSE)))  
}
out <- do.call(rbind, out)  ## combine elements into a single data frame

Select only statistically significant response variables. From a statistical/inferential point of view, this is probably a bad idea ...

out <- out[out$pval<0.05,]

Select the names of variables where the coefficients are above a threshold

big_vars <- out$yvar[abs(out$coef)>threshold]

Compute column sums from another data set ...

colSums(other_data[big_vars])

Upvotes: 1

Related Questions