spillway18
spillway18

Reputation: 221

How do I build a dplyr summarize statement programmatically?

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:

So, like in a map or apply statement, I want to execute code that ends up looking like:

data %>%
  group_by(group_column) %>%
  summarize(new_name_1 = function_1(column_1),
  summarize(new_name_2 = function_2(column_2))

Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.

Desired result

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
  select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

What I get

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean disp_sd hp_mean hp_sd
#>   <dbl> <dbl>     <dbl>   <dbl>   <dbl> <dbl>
#> 1     4     1      91.4   21.4     77.4 16.1 
#> 2     4     2     117.    27.1     87   24.9 
#> 3     6     1     242.    23.3    108.   3.54
#> 4     6     4     164.     4.39   116.   7.51
#> 5     6     6     145     NA      175   NA   
#> 6     8     2     346.    43.4    162.  14.4 
#> 7     8     3     276.     0      180    0   
#> 8     8     4     406.    57.8    234   21.7 
#> 9     8     8     301     NA      335   NA

Upvotes: 7

Views: 323

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18581

I have a package on github {dplyover}

which can help with this kind of tasks. In this case we could use over2 to loop over two character vectors simultaniously. The first vector contains the variable names as string, which is why we have to wrap .x in sym() when applying a function to it. The second vector contains the function names, which we use as .y in a do.call. over2 creates the desired names automatically.

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(over2(c("disp", "hp"),
                  c("mean", "sd"),
                  ~ do.call(.y, list(sym(.x)))
                  ))

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

An alternative way building on the same logic is to use purrr::map2. However, here we have to put some effort into creating vectors with the desired names.

library(purrr)

# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
  set_names(., paste(., myfuns, sep = "_"))

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(map2(myvars,
                 myfuns,
                 ~ do.call(.y, list(sym(.x)))
                 ) %>% bind_cols()
  )

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

Created on 2021-08-20 by the reprex package (v2.0.1)

Upvotes: 1

akrun
akrun

Reputation: 887851

With different functions on different columns, an option is to use collap from collapse

library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))

-output

cyl   disp        hp carb
1   4  91.38 16.133815    1
2   4 116.60 24.859606    2
3   6 241.50  3.535534    1
4   6 163.80  7.505553    4
5   6 145.00        NA    6
6   8 345.50 14.433757    2
7   8 275.80  0.000000    3
8   8 405.50 21.725561    4
9   8 301.00        NA    8

Or the index can be dynamically generated with match

collap(mtcars, ~ cyl + carb, custom = list(fmean =
   match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))

With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later

library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
   mtcars %>%
      group_by(across(c('cyl', 'carb'))) %>% 
      summarise(across(all_of(.x), match.fun(.y), 
         .names = str_c("{.col}_", .y)), .groups = 'drop')) %>% 
    reduce(inner_join)

-output

# A tibble: 9 x 4
    cyl  carb disp_mean hp_sd
  <dbl> <dbl>     <dbl> <dbl>
1     4     1      91.4 16.1 
2     4     2     117.  24.9 
3     6     1     242.   3.54
4     6     4     164.   7.51
5     6     6     145   NA   
6     8     2     346.  14.4 
7     8     3     276.   0   
8     8     4     406.  21.7 
9     8     8     301   NA   

Upvotes: 6

Related Questions