Efficient summarise of different columns with different functions with dplyr

Question

The code below does exactly what I want it to do. Takes a data frame with multiple grouping constraints (ie. id and date) and returns 1 row for each with different functions applied to different columns. In this case, collapsing to character and taking the maximum of the column by grouping.

The issue is scaling this approach to much bigger data frames with 100k rows and groups. How can I reach the same conclusion in a more efficient manner? If this is possible in a tidyverse structure this is preferred (while not preferred, data.table solutions will also be accepted). I want to keep the flexibility to specify the different cols to be included in the function calls, as well as add additional summary calls (e.g. summarise_at(sum of values)).

data <- tibble(id = c(1,1,1,2,3,4,5,5,6,6,6), date = dmy("01/01/2020"), var1 =  1:11, var2 = 12:22, var3 = 1:11)

data %>% 
group_by(id, date) %>%
    {data.frame(
            summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
            summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
            summarise(., count = n(), .groups = "keep") # return count of cases in group
        )} %>% 
    select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
    as_tibble()

# A tibble: 6 x 6
     id date       var1            var2              var3 count
                                
1     1 2020-01-01 1 AND 2 AND 3   12 AND 13 AND 14     3     3
2     2 2020-01-01 4               15                   4     1
3     3 2020-01-01 5               16                   5     1
4     4 2020-01-01 6               17                   6     1
5     5 2020-01-01 7 AND 8         18 AND 19            8     2
6     6 2020-01-01 9 AND 10 AND 11 20 AND 21 AND 22    11     3

Links to this question as well. How can I use summarise_at to apply different functions to different columns?

ekoam · Accepted Answer

Here is the tidyverse approach adapted from @MichaelDewar's answer, it's tidier but I don't think there is any real improvement in the efficiency. Besides, a dataframe with 100k rows is not that big a deal in my opinion. I think a tidyverse solution is just fine.

library(dplyr)

data %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), .groups = "keep"
  )

But if you really want to improve the efficiency, perhaps try this data.table solution

library(data.table)

setDT(data)[, c(
  lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
  list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]

Benchmark

set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)

f1 <- 
  . %>% 
  group_by(id, date) %>%
  {data.frame(
    summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
    summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
    summarise(., count = n(), .groups = "keep") # return count of cases in group
  )} %>% 
  select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
  as_tibble()

f2 <- 
  . %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), 
    .groups = "keep"
  )

f3 <- function(dt) {
  setDT(dt)[, c(
    lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
    list(var3 = max(var3, na.rm = T), count = .N)
  ), by = c("id", "date")]
}

microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))

Result

Unit: milliseconds
       expr     min       lq      mean   median       uq     max neval cld
  f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799   100   c
  f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914   100  b 
 f3(data22)  6.9186  7.80615  8.598227  8.32035  8.68040 15.8358   100 a

Efficient summarise of different columns with different functions with dplyr

Answers (2)

Related Questions