Reputation: 537
I would like to create a function (preferably with an apply
family function, that runs an lm
function over subsets of data from two separate variables in a data frame. For reference, my data is currently structured as follows (this is a dummy data frame that is generally representative of the real data frame I'm working with):
District = c(rep("A", times = 25),
rep("B", times = 25),
rep("C", times = 25),
rep("D", times = 25))
Year = c(1971:1975, 1971:1975, 1971:1975, 1971:1975,
1971:1975, 1971:1975, 1971:1975, 1971:1975,
1971:1975, 1971:1975, 1971:1975, 1971:1975,
1971:1975, 1971:1975, 1971:1975, 1971:1975,
1971:1975, 1971:1975, 1971:1975, 1971:1975)
Crop = c(rep("Wheat", times = 5),
rep("Maize", times = 5),
rep("Rice", times = 5),
rep("Barley", times = 5))
set.seed(100)
Yield = rnorm(100, 2, 0.5)
df <- data.frame(District, Year, Crop, Yield)
I need to generate a linear model (lm
) predicting Yield
as a function of Year
for each Crop
on a per-District
basis. So, I need a model for Wheat in each district, Barley in each district, and so on. I need to automate this to the extent possible as my real data will result in approximately 7000 linear models.
I've found this answer by James Bond to be very helpful as a point of departure. It is very elegant and uses lapply
, which I would like to do as well, if possible (7000 linear models = very slow). However, it is only subsetting a SINGLE column in the dataset, whereas I need to subset two variables for each model. Here is my current (non-working) code modified to run on the above dummy dataset:
df$Dist <- as.factor(df$Dist)
df$Crop <- as.factor(df$Crop)
for (i in 1:length(levels(df$Crop))) {
x <- levels(df$Crop)[i]
dat <- df[df$Crop == x, ]
out <- lapply(levels(dat$Dist), function(z) {
data.frame(District = z,
Slope = lm(Yield ~ Year, data = dat[dat$Dist == z, ])$coef[2],
Crop = x,
row.names=NULL)
})
}
do.call(rbind ,out)
Unfortunately, running the above code only generates a model for the first Crop
level in the dataset (Wheat). See the output below:
District Slope Crop
1 A 0.03125866 Wheat
2 B -0.08108222 Wheat
3 C 0.17172314 Wheat
4 D -0.11278486 Wheat
Any help in being able to loop over both the Crop
and the District
variable would be much appreciated. It seems like I'm close, but seem to missing something fundamental about the for loop.
If it is possible to pass 2 arguments to the lapply
function and avoid the for
loop altogether, that would be amazing. Thanks in advance!
Upvotes: 1
Views: 2454
Reputation: 173697
One option using dplyr:
df_lm <- df %>%
group_by(District,Crop) %>%
do(mod = lm(Yield ~ Year,data = .))
df_coef <- df_lm %>%
do(data.frame(
District = .$District,
Crop = .$Crop,
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
> df_coef
Source: local data frame [32 x 7]
Groups: <by row>
# A tibble: 32 × 7
District Crop var Estimate Std..Error t.value Pr...t..
* <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 A Barley (Intercept) -407.66953514 378.49788671 -1.0770722 0.36034462
2 A Barley Year 0.20771336 0.19183872 1.0827499 0.35818046
3 A Maize (Intercept) 159.81133118 212.90233600 0.7506321 0.50738515
4 A Maize Year -0.08002266 0.10790790 -0.7415830 0.51211787
5 A Rice (Intercept) -68.01125454 117.60578244 -0.5782986 0.60361684
6 A Rice Year 0.03552764 0.05960758 0.5960255 0.59313364
7 A Wheat (Intercept) -59.61828825 134.67806297 -0.4426726 0.66972726
8 A Wheat Year 0.03125866 0.06826053 0.4579317 0.65918309
9 B Barley (Intercept) -319.99755207 57.14553545 -5.5996947 0.01125215
10 B Barley Year 0.16332436 0.02896377 5.6389189 0.01103509
# ... with 22 more rows
Another thing to look at is the lmList
function in nlme.
Upvotes: 2