Reputation: 537
I'm currently working with a large(ish) data frame (13,884 rows and 57 columns) of agricultural data.
One column of the data frame is comprised of names of 'districts' of the country of interest. Also included are 'total area under production' columns for multiple crops, and a 'year' column corresponding to each observation. A simplified version of the data frame:
Dist_names <- c('A', 'B', 'C', 'D')
Rice_area <- rnorm(16, mean = 5, sd = 1)
Random_var <- rep('blah', times = 16)
Maize_area <- rnorm(16, mean = 3, sd = 1)
Random_var1 <- rep('blah', times = 16)
Wheat_area <- rnorm(16, mean = 7, sd = 1)
Year <- c(rep('1966', times = 4), rep('1971', times = 4),
rep('1984', times = 4), rep('1996', times = 4))
df_ag <- data.frame(Dist_names,
Rice_area,
Random_var,
Maize_area,
Random_var1,
Wheat_area,
Year)
df_ag
Dist_names Rice_area Random_var Maize_area Random_var1 Wheat_area Year
1 A 6.266559 blah 3.8740517 blah 7.775330 1966
2 B 5.611816 blah 1.9078029 blah 7.497784 1966
3 C 5.481312 blah 2.2931361 blah 6.556777 1966
4 D 3.982654 blah 2.2146227 blah 6.899663 1966
5 A 6.123487 blah 2.3746220 blah 6.537040 1971
6 B 6.760871 blah 2.6296762 blah 6.994326 1971
7 C 5.123877 blah 3.3364304 blah 7.348202 1971
8 D 5.340764 blah 3.3026722 blah 6.316179 1971
9 A 5.005836 blah 2.6335372 blah 7.031141 1984
10 B 4.224905 blah 4.4294862 blah 7.822868 1984
11 C 5.297800 blah 2.3048798 blah 4.287632 1984
12 D 7.870687 blah 1.5812036 blah 6.171034 1984
13 A 4.575766 blah 0.3331641 blah 6.971024 1996
14 B 5.717461 blah 2.7911101 blah 7.396314 1996
15 C 4.679965 blah 3.0742187 blah 5.575169 1996
16 D 3.892069 blah 2.5029748 blah 7.660881 1996
So, what I'm attempting to do is to loop through the dist_names
variable, create a linear model for each crop_area
variable as a function of the year
variable, and plot the output along with an abline()
. It's necessary to automate this task, as there are 332 unique district names x 28 crops = 9296 plots to be generated.
I'm able to loop through a single crop variable and generate the visuals using code similar to the following:
par(ask=TRUE)
dists <- unique(df_ag$Dist_names)
for (dis in dists) {
dat <- df_ag[df_ag$Dist_names == dis, ]
m <- lm(Rice_area ~ Year, data = dat)
plot(dat$Year, dat$Rice_area, main=paste0(dat$Dist_name[1], ', ', dis))
abline(m)
}
However, I'm having a difficult time generalizing the code to be able to do the same thing as above for ALL of the crop_area
variables. My current thinking is that I need a function comprised of a nested for
loop. Here is my latest (non-working) attempt:
par(ask=TRUE)
graph_fun <- function(df, na.rm = TRUE) {
# find unique districts within dist_names
dists <- unique(df_ag$Dist_names)
# total area variables in data frame
ta_vars <- df_ag[grepl("area", names(df_ag))]
# loop through each district name
for (dis in dists) {
# loop through each crop variable
for (i in 1:ncol(ta_vars)) {
# new variable with each district and each crop
dat <- df_ag[df_ag$Dist_names == dis, ta_vars[i]]
}
# generate linear models and plots
m <- lm(dat[j], Year, data = dat)
plot(dat$Year, dat[j], main=paste0(dat$Dist_names[1], ', ', dis,))
abline(m)
}
}
Needless to say, the foregoing code does not do the trick. I'm currently getting the following error, although I'm sure there are multiple areas where the code is wrong:
Error in x[j] : invalid subscript type 'list'
Any guidance would be greatly appreciated. I'm not married to the for
loop concept if anyone can conceive of a way to accomplish the task with an apply
family function.
Upvotes: 1
Views: 1177
Reputation: 70653
If I were be doing this, I would melt the data into a long format and use one of the handy functions which slice and dice the data. Something along the lines of
library(tidyr)
xy <- gather(df_ag, key = crop, value = area, -Random_var, -Random_var1, -Year, -Dist_names)
xy$Year <- as.numeric(as.character(xy$Year))
by(data = xy, INDICES = list(xy$Dist_names, xy$crop), FUN = function(x) {
mdl <- lm(area ~ Year, data = x)
plot(area ~ Year, data = x, type = "p")
abline(mdl)
return(mdl)
})
... but you probably need a mixed effects model anyway.
And this produces the plot you're after.
Dist_names <- c('A', 'B', 'C', 'D')
Rice_area <- rnorm(16, mean = 5, sd = 1)
Random_var <- rep('blah', times = 16)
Maize_area <- rnorm(16, mean = 3, sd = 1)
Random_var1 <- rep('blah', times = 16)
Wheat_area <- rnorm(16, mean = 7, sd = 1)
Year <- as.numeric(c(rep('1966', times = 4), rep('1971', times = 4),
rep('1984', times = 4), rep('1996', times = 4)))
df_ag <- data.frame(Dist_names,
Rice_area,
Random_var,
Maize_area,
Random_var1,
Wheat_area,
Year)
graph_fun <- function(df) {
# find unique districts within dist_names
dists <- unique(df$Dist_names)
# total area variables in data frame
ta_vars <- df[grepl("area", names(df))]
# browser() # if you enable this, it will, upon execution, stop the function here
# you can then look around the function and run whichever bit of code you wish to poke around
par(mfrow = c(length(dists), ncol(ta_vars)))
# loop through each district name
for (dis in dists) {
# loop through each crop variable
for (ta_var in colnames(ta_vars)) {
# new variable with each district and each crop
dat <- df[df$Dist_names == dis, ]
# generate linear models and plots
m <- lm(formula(paste(ta_var, "~ Year"), data = dat))
plot(dat$Year, dat[, ta_var], main=paste0(unique(dat$Dist_names), ', ', ta_var))
abline(m)
}
}
}
graph_fun(df_ag)
This may not work for many levels so you will have to tweak the part to only make n amount of levels at a time.
Upvotes: 1