Ursus Frost
Ursus Frost

Reputation: 405

Efficient method to group and then apply function across subsets of rows within group in R

I'm looking for a general approach to applying an arbitrary function across arbitrary subsets of rows within arbitrary groups of data in R.

Here is an example of the function that I have written that groups data then applies an arbitrary function such as sum or mean across a subset (window) of rows in a given a group.

#' @param DATA data frame. This is the data set containing the data to be grouped, 
#' ordered, and used in the calculation.
#' @param GROUP_BY character vector. This is a vector of the columns of the data 
#' frame that are to be used to group different observations which are then 
#' FUNCTIONed across. 
#' @param ORDER_BY character. This is the name of the column of DATA that is 
#' to be used for determining the WINDOW.
#' @param CALC_OVER character. The name of the column over which the calculation is 
#' to be performed in accordance with FUNCTION.
#' @param WINDOW integer. Positive integer will sum the WINDOW values of CALC_OVER 
#' on and after each ORDER_BY. Negative integer will sum the WINDOW values of CALC_OVER 
#' values on and before each ORDER_BY.
#' @param FUNCTION character. Name of function to be used on values defined by 
#' CALC_OVER over the WINDOW.
#' FUNCTION applied to the CALC_OVER data.
#' @export

ApplyFunctionWindow <- function(DATA, GROUP_BY, ORDER_BY,  CALC_OVER, 
                                WINDOW = -4L, FUNCTION) {             
  # dplyr's arrange, order_by, and mutate would probably be faster but are a pain 
    # to implement with dynamic variables
  if (length(GROUP_BY) > 1) {
    grouped_data <- split(x = DATA, f = as.list(DATA[, GROUP_BY]), drop = TRUE)
  } else {
    grouped_data <- split(x = DATA, f = DATA[, GROUP_BY], drop = TRUE)
  }

  calculations <- dplyr::data_frame()

  for (g in 1:length(grouped_data)) {
    grouped_data_frame <- grouped_data[[g]]

    for (r in 1:nrow(grouped_data_frame)) {
      grouped_data_frame <- grouped_data_frame[ 
        order(grouped_data_frame[, ORDER_BY]), 
        ]

      if( WINDOW < 0) {
        if( (r + 1 + WINDOW) < 1L | (r + 1 + WINDOW) > nrow(grouped_data_frame)) {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- NA
        } else {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- 
            do.call(what = FUNCTION, 
                    args = list(grouped_data_frame[r:(r + 1 + WINDOW), CALC_OVER]))
        }
      } else {
        if((r - 1 + WINDOW) > nrow(grouped_data_frame)) {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- NA
        } else {
          grouped_data_frame[r, paste(CALC_OVER, FUNCTION, WINDOW, sep = "_")] <- 
            do.call(what = FUNCTION, 
                    args = list(grouped_data_frame[r:(r - 1 + WINDOW), CALC_OVER]))
        }
      }
    }

      calculations <- dplyr::bind_rows(calculations, grouped_data_frame)
    } else {
      calculations <- dplyr::bind_rows(calculations, grouped_data_frame)
    }

  }

  calculations
}  

The following is an example data set along with the output from my function. It works as expected and is quick on small data sets. However, I often have data sets of millions of rows with 10-20 thousand different groups.

example_data <- data.frame(id_1 = c(rep("jane", 8), rep("joe", 12), rep("jack", 16)), 
                  id_2 = c(rep("doe", 8), rep("doe", 12), rep("smith", 16)),
                  year = c(rep(2010, 4), rep(2011, 4), 
                           rep(2008, 4), rep(2009, 4), rep(2010, 4), 
                           rep(2005, 4), rep(2006, 4), rep(2007, 4), rep(2008, 4)), 
                  quarter = rep(seq(1:4), 9),
           data_value = rnorm(36, 10, 1),
           stringsAsFactors = FALSE
                  )

example_data[, "year_quarter"] <- paste(example_data[, "year"], 
                                        "_",
                                        example_data[, "quarter"])

trailing_four_quarters <- ApplyFunctionWindow(DATA = example_data, 
                                              GROUP_BY = c("id_1", "id_2"), 
                                              ORDER_BY = "year_quarter", 
                                              CALC_OVER = "data_value", 
                                              WINDOW = -4L, 
                                              FUNCTION = "sum", 
                                              OMIT_NA = FALSE)

Upvotes: 0

Views: 142

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269734

I haven't examined all your code but producing a rolling sum by group can be compactly implemented like this. We first define the function we want to apply and then use ave to run it by group:

library(zoo)

roll <- function(x) if (length(x) >= 4) rollsumr(x, 4, fill = NA) else NA
transform(example_data, four_quarters = ave(data_value, id_1, id_2, FUN = roll))

Upvotes: 2

Chris
Chris

Reputation: 6372

Apply is not the only way to get things done in r - this is far better done with data.table

library(data.table)
setDT(example_data)

cols <- c("data_value")
cols_L4Q <- paste0(cols,"_L4Q")

example_data <- example_data[order(id_1,id_2,year,quarter)]

example_data[, (cols_L4Q) := lapply(.SD, function(x) { Reduce(`+`, shift(x, 0L:(4 - 1L), type = "lag")) }), .SDcols = cols, by = .(id_1,id_2)]

This works on multiple columns, just build cols accordingly.

`+` can be any function that aggregates a vector (including mean, sum etc). If you don't need the trailing behavior, you can remove the shift function.

Upvotes: 3

Related Questions