Run lm() function over subsets of 2 different variables in data frame

Question

I would like to create a function (preferably with an apply family function, that runs an lm function over subsets of data from two separate variables in a data frame. For reference, my data is currently structured as follows (this is a dummy data frame that is generally representative of the real data frame I'm working with):

District = c(rep("A", times = 25),
             rep("B", times = 25),
             rep("C", times = 25),
             rep("D", times = 25))

Year = c(1971:1975, 1971:1975, 1971:1975, 1971:1975,
         1971:1975, 1971:1975, 1971:1975, 1971:1975,
         1971:1975, 1971:1975, 1971:1975, 1971:1975,
         1971:1975, 1971:1975, 1971:1975, 1971:1975,
         1971:1975, 1971:1975, 1971:1975, 1971:1975)

Crop = c(rep("Wheat",  times = 5),
         rep("Maize",  times = 5),
         rep("Rice",   times = 5),
         rep("Barley", times = 5))

set.seed(100)
Yield = rnorm(100, 2, 0.5)

df <- data.frame(District, Year, Crop, Yield)

I need to generate a linear model (lm) predicting Yield as a function of Year for each Crop on a per-District basis. So, I need a model for Wheat in each district, Barley in each district, and so on. I need to automate this to the extent possible as my real data will result in approximately 7000 linear models.

I've found this answer by James Bond to be very helpful as a point of departure. It is very elegant and uses lapply, which I would like to do as well, if possible (7000 linear models = very slow). However, it is only subsetting a SINGLE column in the dataset, whereas I need to subset two variables for each model. Here is my current (non-working) code modified to run on the above dummy dataset:

df$Dist <- as.factor(df$Dist)
df$Crop <- as.factor(df$Crop)

for (i in 1:length(levels(df$Crop))) {
  x <- levels(df$Crop)[i]
  dat <- df[df$Crop == x, ]
  out <- lapply(levels(dat$Dist), function(z) {
            data.frame(District = z, 
                       Slope = lm(Yield ~ Year, data = dat[dat$Dist == z, ])$coef[2], 
                       Crop = x,
                       row.names=NULL)
  })
}

do.call(rbind ,out)

Unfortunately, running the above code only generates a model for the first Crop level in the dataset (Wheat). See the output below:

  District       Slope  Crop
1        A  0.03125866 Wheat
2        B -0.08108222 Wheat
3        C  0.17172314 Wheat
4        D -0.11278486 Wheat

Any help in being able to loop over both the Crop and the District variable would be much appreciated. It seems like I'm close, but seem to missing something fundamental about the for loop.

If it is possible to pass 2 arguments to the lapply function and avoid the for loop altogether, that would be amazing. Thanks in advance!

joran · Accepted Answer

One option using dplyr:

df_lm <- df %>%
  group_by(District,Crop) %>%
  do(mod = lm(Yield ~ Year,data = .))

df_coef <- df_lm %>%
  do(data.frame(
    District = .$District,
    Crop = .$Crop,
    var = names(coef(.$mod)),
    coef(summary(.$mod)))
    )

> df_coef
Source: local data frame [32 x 7]
Groups: 

# A tibble: 32 × 7
   District   Crop         var      Estimate   Std..Error    t.value   Pr...t..
*                                        
1         A Barley (Intercept) -407.66953514 378.49788671 -1.0770722 0.36034462
2         A Barley        Year    0.20771336   0.19183872  1.0827499 0.35818046
3         A  Maize (Intercept)  159.81133118 212.90233600  0.7506321 0.50738515
4         A  Maize        Year   -0.08002266   0.10790790 -0.7415830 0.51211787
5         A   Rice (Intercept)  -68.01125454 117.60578244 -0.5782986 0.60361684
6         A   Rice        Year    0.03552764   0.05960758  0.5960255 0.59313364
7         A  Wheat (Intercept)  -59.61828825 134.67806297 -0.4426726 0.66972726
8         A  Wheat        Year    0.03125866   0.06826053  0.4579317 0.65918309
9         B Barley (Intercept) -319.99755207  57.14553545 -5.5996947 0.01125215
10        B Barley        Year    0.16332436   0.02896377  5.6389189 0.01103509
# ... with 22 more rows

Another thing to look at is the lmList function in nlme.

Run lm() function over subsets of 2 different variables in data frame

Answers (1)

Related Questions