Reputation: 221
I'm trying to do some dplyr programming and having trouble. I'd like to group_by
an arbitrary number of variables (thus, across
), and then summarize
based on arbitrary length (but all the same length) vectors of:
So, like in a map
or apply
statement, I want to execute code that ends up looking like:
data %>%
group_by(group_column) %>%
summarize(new_name_1 = function_1(column_1),
summarize(new_name_2 = function_2(column_2))
Here's an example of what I want and my best shot so far. I know I can use the names
argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.
Desired result
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
What I get
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups: cyl [3]
#> cyl carb disp_mean disp_sd hp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 21.4 77.4 16.1
#> 2 4 2 117. 27.1 87 24.9
#> 3 6 1 242. 23.3 108. 3.54
#> 4 6 4 164. 4.39 116. 7.51
#> 5 6 6 145 NA 175 NA
#> 6 8 2 346. 43.4 162. 14.4
#> 7 8 3 276. 0 180 0
#> 8 8 4 406. 57.8 234 21.7
#> 9 8 8 301 NA 335 NA
Upvotes: 7
Views: 323
Reputation: 18581
I have a package on github {dplyover}
which can help with this kind of tasks. In this case we could use over2
to
loop over two character vectors simultaniously. The first vector contains the
variable names as string, which is why we have to wrap .x
in sym()
when
applying a function to it. The second vector contains the function names,
which we use as .y
in a do.call
. over2
creates the desired names automatically.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(over2(c("disp", "hp"),
c("mean", "sd"),
~ do.call(.y, list(sym(.x)))
))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
An alternative way building on the same logic is to use purrr::map2
. However,
here we have to put some effort into creating vectors with the desired names.
library(purrr)
# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
set_names(., paste(., myfuns, sep = "_"))
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(map2(myvars,
myfuns,
~ do.call(.y, list(sym(.x)))
) %>% bind_cols()
)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
Created on 2021-08-20 by the reprex package (v2.0.1)
Upvotes: 1
Reputation: 887851
With different functions on different columns, an option is to use collap
from collapse
library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))
-output
cyl disp hp carb
1 4 91.38 16.133815 1
2 4 116.60 24.859606 2
3 6 241.50 3.535534 1
4 6 163.80 7.505553 4
5 6 145.00 NA 6
6 8 345.50 14.433757 2
7 8 275.80 0.000000 3
8 8 405.50 21.725561 4
9 8 301.00 NA 8
Or the index can be dynamically generated with match
collap(mtcars, ~ cyl + carb, custom = list(fmean =
match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))
With tidyverse
, an option is to loop over the column names of interest and the functions in map2
and do a join later
library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
mtcars %>%
group_by(across(c('cyl', 'carb'))) %>%
summarise(across(all_of(.x), match.fun(.y),
.names = str_c("{.col}_", .y)), .groups = 'drop')) %>%
reduce(inner_join)
-output
# A tibble: 9 x 4
cyl carb disp_mean hp_sd
<dbl> <dbl> <dbl> <dbl>
1 4 1 91.4 16.1
2 4 2 117. 24.9
3 6 1 242. 3.54
4 6 4 164. 7.51
5 6 6 145 NA
6 8 2 346. 14.4
7 8 3 276. 0
8 8 4 406. 21.7
9 8 8 301 NA
Upvotes: 6