Defcon
Defcon

Reputation: 817

R adding regression coeffcients to data frame

I have a list of dataframes that contains many subsets of data (470ish). I am trying to run a regression on each of them and add the regression coefficients to a dataframe. The dataframe will contain the coefficients for all dependent variables on each subgroup. I tried iterating with a for loop but obviously that is not the right way. I think the solution has something to do with lapply?

for (i in ListOfTraining){


    lm(JOB_VOLUME ~  FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC        data=ListOfTraining[[i]])

}

Thanks for any advice!

Upvotes: 1

Views: 3673

Answers (3)

ebyerly
ebyerly

Reputation: 672

You can solve this using the for loop, if you prefer. Your problem is that the results aren't being saved to an object as the loop progresses. You can see the below for an example using the built-in mtcars dataframe.

(This first example is revised based on OP's request for an example of how to also extract the R squared value.)

ListOfTraining <- list(mtcars, mtcars)
results <- list()

for (i in seq_along(ListOfTraining)) {
  lm_obj <- lm(disp ~ qsec, data = ListOfTraining[[i]])
  tmp <- c(lm_obj$coefficients, summary(lm_obj)$r.squared)
  names(tmp)[length(tmp)] <- "r.squared"
  results[[i]] <- tmp
}

results <- do.call(rbind, results)
results

You can also rewrite the for loop using lapply as demoed below.

ListOfTraining <- list(mtcars, mtcars)
results <- list()

results <- lapply(ListOfTraining, function(x) {
  lm(disp ~ qsec, data = x)$coefficients
})

results <- do.call(rbind, results)
results

Finally, you can use the plyr package's ldply function which will convert the list applied outputs into a dataframe automatically (if possible).

ListOfTraining <- list(mtcars, mtcars)
results <- plyr::ldply(ListOfTraining, function(x) {
  lm(disp ~ qsec, data = x)$coefficients
})
results

Upvotes: 2

Rorschach
Rorschach

Reputation: 32416

The function tidy from package broom handles this nicely.

library(dplyr)          # bind_rows is more efficient than do.call(rbind, ...)
library(broom)          # put statistics into data.frame
bind_rows(lapply(ListOfTraining, function(dat)
    tidy(lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC, data=dat))))

Example

dataList <- split(mtcars, mtcars$cyl)  # list of data.frames by number of cylinders
lapply(dataList, function(dat) tidy(lm(mpg ~ disp + hp, data=dat))) %>%  # fit models
  bind_rows() %>%                                                        # combine into one data.frame
  mutate(model=rep(1:length(dataList), each=3))                          # add a model ID column
#          term     estimate   std.error   statistic      p.value model
# 1 (Intercept) 43.040057552 4.235724713 10.16120274 7.531962e-06     1
# 2        disp -0.119536016 0.036945788 -3.23544366 1.195900e-02     1
# 3          hp -0.046091563 0.047423668 -0.97191054 3.595602e-01     1
# 4 (Intercept) 20.151209478 6.938235241  2.90437104 4.392508e-02     2
# 5        disp  0.001796527 0.020195109  0.08895852 9.333909e-01     2
# 6          hp -0.006032441 0.034597750 -0.17435935 8.700522e-01     2
# 7 (Intercept) 24.044775630 4.045729006  5.94324919 9.686231e-05     3
# 8        disp -0.018627566 0.009456903 -1.96973225 7.456584e-02     3
# 9          hp -0.011315585 0.012572498 -0.90002676 3.873854e-01     3

Alternatively, you could bind the data.frames beforehand, assuming they have the same columns. Then, fit models using lmList from nlme package.

## Combine list of data.frames into one data.frame with a factor variable
lengths <- sapply(dataList, nrow)  # in case data.frames have different num. rows
dat <- dataList %>% bind_rows() %>% 
  mutate(group=rep(1:length(dataList), times=lengths))  # group id column

library(nlme)  # lmList()
models <- lmList(mpg ~ disp + hp | group, data=dat)  # make models, grouped by group
models$coefficients
#   (Intercept)         disp           hp
# 1    43.04006 -0.119536016 -0.046091563
# 2    20.15121  0.001796527 -0.006032441
# 3    24.04478 -0.018627566 -0.011315585

Upvotes: 3

Greg Snow
Greg Snow

Reputation: 49640

Your current code runs the regression, but does not do anything with the results (inside of a loop they are not even autoprinted), so they are just discarded. You need to have some structure to save the results into.

The following code will create a matrix of coefficients (assuming that all the regressions run without error and the number of final coefficients is the same):

my.coef <- sapply( ListOfTraining, function(dat) { 
    coef(lm( JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC,
             data=dat) )
})

The matrix can then be converted to a data frame (you could also use lapply and convert to a data frame, but I think the sapply option is probably a little simpler).

Upvotes: 1

Related Questions