Keith Hughitt
Keith Hughitt

Reputation: 4960

Dplyr group_by summarize keep min/max value for each column within group, depending on column suffix

I have a dataframe containing groups that I want to collapse into single entries in such a way that the values of the new rows are either the minimum or maximum values within each group, depending on the column name.

For example, given:

set.seed(1)
dat <- data.frame(grp = c('A', 'A', 'B', 'B'), 
                  v1_high = rnorm(4), 
                  v2_high = rnorm(4), 
                  v3_low = rnorm(4))
# original
   grp    v1_high    v2_high     v3_low
 1   A -0.6264538  0.3295078  0.5757814
 2   A  0.1836433 -0.8204684 -0.3053884
 3   B -0.8356286  0.4874291  1.5117812
 4   B  1.5952808  0.7383247  0.3898432

I would like to generate a new dataframe with two entries corresponding to the groups "A" and "B" with the maximum values of each column ending in _high and the minimum values of each column ending in _low.

In this case:

# desired result
  grp   v1_high   v2_high     v3_low
1   A 0.1836433 0.3295078 -0.3053884
2   B 1.5952808 0.7383247  0.3898432

Finally, the number and names of such columns are not known beforehand.

Ideally, the summarization would also handle the case where there are no _min columns present, and only _max ones, although I can always check for this scenario manually and handle the two cases in separately using an if () { .. } statement.

Any suggestions? It seems like this should be possible using the summarize_at() function, but I haven't figured out how to apply different functions to different sets of columns.

Upvotes: 2

Views: 2574

Answers (1)

akrun
akrun

Reputation: 887088

An option would be to group by 'grp', get the max of columns that end with 'high' (column names), use that also as grouping column and get the min of columns that end with 'low'

library(dplyr)    
dat %>%
   group_by(grp) %>%  
   mutate_at(vars(ends_with('high')), max) %>% 
   group_by_at(vars(ends_with('high')), .add = TRUE) %>% 
   summarise_at(vars(ends_with('low')), min)
# A tibble: 2 x 4
# Groups:   grp, v1_high [2]
#  grp   v1_high v2_high v3_low
#  <fct>   <dbl>   <dbl>  <dbl>
#1 A       0.184   0.330 -0.305
#2 B       1.60    0.738  0.390

It would also work if there are no min columns

dat[-4] %>%
    group_by(grp) %>%  
    mutate_at(vars(ends_with('high')), max) %>% 
    group_by_at(vars(ends_with('high')), .add = TRUE) %>%   
    summarise_at(vars(ends_with('low')), min)
# A tibble: 2 x 3
# Groups:   grp, v1_high [2]
#  grp   v1_high v2_high
#  <fct>   <dbl>   <dbl>
#1 A       0.184   0.330
#2 B       1.60    0.738

Or another option is map2

library(purrr)
map2(list(min, max), list('low', 'high'), ~ 
      dat %>% 
         select(grp, ends_with(.y)) %>%
         group_by(grp) %>%
         summarise_all(.x)) %>% 
         reduce(inner_join, by = 'grp')

Upvotes: 1

Related Questions