Efficient method to group and then apply function across subsets of rows within group in R

Question

I'm looking for a general approach to applying an arbitrary function across arbitrary subsets of rows within arbitrary groups of data in R.

Here is an example of the function that I have written that groups data then applies an arbitrary function such as sum or mean across a subset (window) of rows in a given a group.

#' @param DATA data frame. This is the data set containing the data to be grouped, 
#' ordered, and used in the calculation.
#' @param GROUP_BY character vector. This is a vector of the columns of the data 
#' frame that are to be used to group different observations which are then 
#' FUNCTIONed across. 
#' @param ORDER_BY character. This is the name of the column of DATA that is 
#' to be used for determining the WINDOW.
#' @param CALC_OVER character. The name of the column over which the calculation is 
#' to be performed in accordance with FUNCTION.
#' @param WINDOW integer. Positive integer will sum the WINDOW values of CALC_OVER 
#' on and after each ORDER_BY. Negative integer will sum the WINDOW values of CALC_OVER 
#' values on and before each ORDER_BY.
#' @param FUNCTION character. Name of function to be used on values defined by 
#' CALC_OVER over the WINDOW.
#' FUNCTION applied to the CALC_OVER data.
#' @export

ApplyFunctionWindow <- function(DATA, GROUP_BY, ORDER_BY,  CALC_OVER, 
                                WINDOW = -4L, FUNCTION) {             
  # dplyr's arrange, order_by, and mutate would probably be faster but are a pain 
    # to implement with dynamic variables
  if (length(GROUP_BY) > 1) {
    grouped_data <- split(x = DATA, f = as.list(DATA[, GROUP_BY]), drop = TRUE)
  } else {
    grouped_data <- split(x = DATA, f = DATA[, GROUP_BY], drop = TRUE)
  }

  calculations <- dplyr::data_frame()

  for (g in 1:length(grouped_data)) {
    grouped_data_frame <- grouped_data[[g]]

    for (r in 1:nrow(grouped_data_frame)) {
      grouped_data_frame <- grouped_data_frame[ 
        order(grouped_data_frame[, ORDER_BY]), 
        ]

      if( WINDOW < 0) {
        if( (r + 1 + WINDOW) < 1L | (r + 1 + WINDOW) > nrow(grouped_data_frame)) {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- NA
        } else {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- 
            do.call(what = FUNCTION, 
                    args = list(grouped_data_frame[r:(r + 1 + WINDOW), CALC_OVER]))
        }
      } else {
        if((r - 1 + WINDOW) > nrow(grouped_data_frame)) {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- NA
        } else {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- 
            do.call(what = FUNCTION, 
                    args = list(grouped_data_frame[r:(r - 1 + WINDOW), CALC_OVER]))
        }
      }
    }

      calculations <- dplyr::bind_rows(calculations, grouped_data_frame)
    } else {
      calculations <- dplyr::bind_rows(calculations, grouped_data_frame)
    }

  }

  calculations
}

The following is an example data set along with the output from my function. It works as expected and is quick on small data sets. However, I often have data sets of millions of rows with 10-20 thousand different groups.

example_data <- data.frame(id_1 = c(rep("jane", 8), rep("joe", 12), rep("jack", 16)), 
                  id_2 = c(rep("doe", 8), rep("doe", 12), rep("smith", 16)),
                  year = c(rep(2010, 4), rep(2011, 4), 
                           rep(2008, 4), rep(2009, 4), rep(2010, 4), 
                           rep(2005, 4), rep(2006, 4), rep(2007, 4), rep(2008, 4)), 
                  quarter = rep(seq(1:4), 9),
           data_value = rnorm(36, 10, 1),
           stringsAsFactors = FALSE
                  )

example_data[, "year_quarter"] <- paste(example_data[, "year"], 
                                        "_",
                                        example_data[, "quarter"])

trailing_four_quarters <- ApplyFunctionWindow(DATA = example_data, 
                                              GROUP_BY = c("id_1", "id_2"), 
                                              ORDER_BY = "year_quarter", 
                                              CALC_OVER = "data_value", 
                                              WINDOW = -4L, 
                                              FUNCTION = "sum", 
                                              OMIT_NA = FALSE)

Chris · Accepted Answer

Apply is not the only way to get things done in r - this is far better done with data.table

library(data.table)
setDT(example_data)

cols <- c("data_value")
cols_L4Q <- paste0(cols,"_L4Q")

example_data <- example_data[order(id_1,id_2,year,quarter)]

example_data[, (cols_L4Q) := lapply(.SD, function(x) { Reduce(`+`, shift(x, 0L:(4 - 1L), type = "lag")) }), .SDcols = cols, by = .(id_1,id_2)]

This works on multiple columns, just build cols accordingly.

`+` can be any function that aggregates a vector (including mean, sum etc). If you don't need the trailing behavior, you can remove the shift function.

Efficient method to group and then apply function across subsets of rows within group in R

Answers (2)

Related Questions