JmeCS
JmeCS

Reputation: 537

Create graphing function that loops over: 1) unique rows in one column AND 2) all values in certain other columns

I'm currently working with a large(ish) data frame (13,884 rows and 57 columns) of agricultural data.

One column of the data frame is comprised of names of 'districts' of the country of interest. Also included are 'total area under production' columns for multiple crops, and a 'year' column corresponding to each observation. A simplified version of the data frame:

Dist_names  <- c('A', 'B', 'C', 'D')
Rice_area   <- rnorm(16, mean = 5, sd = 1)
Random_var  <- rep('blah', times = 16)
Maize_area  <- rnorm(16, mean = 3, sd = 1)
Random_var1 <- rep('blah', times = 16)
Wheat_area  <- rnorm(16, mean = 7, sd = 1)
Year        <- c(rep('1966', times = 4), rep('1971', times = 4), 
                 rep('1984', times = 4), rep('1996', times = 4))

df_ag <- data.frame(Dist_names, 
                    Rice_area,
                    Random_var,
                    Maize_area,
                    Random_var1,
                    Wheat_area,
                    Year) 
df_ag

   Dist_names Rice_area Random_var Maize_area Random_var1 Wheat_area Year
1           A  6.266559       blah  3.8740517        blah   7.775330 1966
2           B  5.611816       blah  1.9078029        blah   7.497784 1966
3           C  5.481312       blah  2.2931361        blah   6.556777 1966
4           D  3.982654       blah  2.2146227        blah   6.899663 1966
5           A  6.123487       blah  2.3746220        blah   6.537040 1971
6           B  6.760871       blah  2.6296762        blah   6.994326 1971
7           C  5.123877       blah  3.3364304        blah   7.348202 1971
8           D  5.340764       blah  3.3026722        blah   6.316179 1971
9           A  5.005836       blah  2.6335372        blah   7.031141 1984
10          B  4.224905       blah  4.4294862        blah   7.822868 1984
11          C  5.297800       blah  2.3048798        blah   4.287632 1984
12          D  7.870687       blah  1.5812036        blah   6.171034 1984
13          A  4.575766       blah  0.3331641        blah   6.971024 1996
14          B  5.717461       blah  2.7911101        blah   7.396314 1996
15          C  4.679965       blah  3.0742187        blah   5.575169 1996
16          D  3.892069       blah  2.5029748        blah   7.660881 1996

So, what I'm attempting to do is to loop through the dist_names variable, create a linear model for each crop_area variable as a function of the year variable, and plot the output along with an abline(). It's necessary to automate this task, as there are 332 unique district names x 28 crops = 9296 plots to be generated.

I'm able to loop through a single crop variable and generate the visuals using code similar to the following:

par(ask=TRUE)
dists <- unique(df_ag$Dist_names)
for (dis in dists) {
  dat <- df_ag[df_ag$Dist_names == dis, ]
  m <- lm(Rice_area ~ Year, data = dat)
  plot(dat$Year, dat$Rice_area, main=paste0(dat$Dist_name[1], ', ', dis))
  abline(m)
}

However, I'm having a difficult time generalizing the code to be able to do the same thing as above for ALL of the crop_area variables. My current thinking is that I need a function comprised of a nested for loop. Here is my latest (non-working) attempt:

par(ask=TRUE)
graph_fun <- function(df, na.rm = TRUE) {

  # find unique districts within dist_names
  dists <- unique(df_ag$Dist_names)

  # total area variables in data frame
  ta_vars <- df_ag[grepl("area", names(df_ag))]

  # loop through each district name
  for (dis in dists) {

    # loop through each crop variable
    for (i in 1:ncol(ta_vars)) {

      # new variable with each district and each crop        
      dat <- df_ag[df_ag$Dist_names == dis, ta_vars[i]]

    }

    # generate linear models and plots
    m <- lm(dat[j], Year, data = dat)
    plot(dat$Year, dat[j], main=paste0(dat$Dist_names[1], ', ', dis,))
    abline(m)

  }

} 

Needless to say, the foregoing code does not do the trick. I'm currently getting the following error, although I'm sure there are multiple areas where the code is wrong:

Error in x[j] : invalid subscript type 'list'

Any guidance would be greatly appreciated. I'm not married to the for loop concept if anyone can conceive of a way to accomplish the task with an apply family function.

Upvotes: 1

Views: 1177

Answers (1)

Roman Luštrik
Roman Luštrik

Reputation: 70653

If I were be doing this, I would melt the data into a long format and use one of the handy functions which slice and dice the data. Something along the lines of

library(tidyr)
xy <- gather(df_ag, key = crop, value = area, -Random_var, -Random_var1, -Year, -Dist_names)
xy$Year <- as.numeric(as.character(xy$Year))


by(data = xy, INDICES = list(xy$Dist_names, xy$crop), FUN = function(x) {
  mdl <- lm(area ~ Year, data = x)
  plot(area ~ Year, data = x, type = "p")
  abline(mdl)

  return(mdl)
  })

... but you probably need a mixed effects model anyway.

And this produces the plot you're after.

Dist_names  <- c('A', 'B', 'C', 'D')
Rice_area   <- rnorm(16, mean = 5, sd = 1)
Random_var  <- rep('blah', times = 16)
Maize_area  <- rnorm(16, mean = 3, sd = 1)
Random_var1 <- rep('blah', times = 16)
Wheat_area  <- rnorm(16, mean = 7, sd = 1)
Year        <- as.numeric(c(rep('1966', times = 4), rep('1971', times = 4), 
                 rep('1984', times = 4), rep('1996', times = 4)))

df_ag <- data.frame(Dist_names, 
                    Rice_area,
                    Random_var,
                    Maize_area,
                    Random_var1,
                    Wheat_area,
                    Year)

graph_fun <- function(df) {
  # find unique districts within dist_names
  dists <- unique(df$Dist_names)

  # total area variables in data frame
  ta_vars <- df[grepl("area", names(df))]

# browser() # if you enable this, it will, upon execution, stop the function here
# you can then look around the function and run whichever bit of code you wish to poke around
  par(mfrow = c(length(dists), ncol(ta_vars)))

  # loop through each district name
  for (dis in dists) {

    # loop through each crop variable
    for (ta_var in colnames(ta_vars)) {

      # new variable with each district and each crop        
      dat <- df[df$Dist_names == dis, ]

      # generate linear models and plots
      m <- lm(formula(paste(ta_var, "~ Year"), data = dat))
      plot(dat$Year, dat[, ta_var], main=paste0(unique(dat$Dist_names), ', ', ta_var))
      abline(m)
    }

  }
}

graph_fun(df_ag)

This may not work for many levels so you will have to tweak the part to only make n amount of levels at a time.

Upvotes: 1

Related Questions