Managing dplyr group_by function to keep the grouping variable when used in combination with group_modify

Question

I'm trying to use the function group_modify (which I've learned about here).

The goal is to take a data.frame, split it with group_by and then apply a home made function that do some reorganisation (namely sorting, selecting the "best row" and if more than one, average the values). I need the output data.frame to have all the columns of the original one.

Here is a RE that will make everything clearer:

The data:

library(dplyr)
(dd <- data.frame(id = c("a", "a", "b", "b", "c", "c", "c"), cat = c("s2", "s1", "s1", "s1", "s3", "s2", "s2"), val = 1:7))
  id cat val
1  a  s2   1
2  a  s1   2
3  b  s1   3
4  b  s1   4
5  c  s3   5
6  c  s2   6
7  c  s2   7

My function (basic one that shows my problem, but not exactly the one I'm actually using):

simple_fun <- function(slice, key){
  big_out_to_show_error <<- slice

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])

  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val))
  }
  
  return(data.frame(temp2))
  
}

The output I want (one row per ID having the "best" cat and if more than one row, average of val and having all the variables from the original data.frame):

  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

My try with dplyr::group_modify function throws an error:

dd %>% 
   group_by(id) %>%
   group_modify(simple_fun)
 Show Traceback
 
 Rerun with Debug
 Error: Column `id` is unknown

This is because the slice that is used do not include the grouping variable. This can be seen by this simple code that uses the line big_out_to_show_error <<- slice in the main function and limiting to id=="a":

filter(dd, id=="a") %>% 
   group_by(id) %>%
   group_modify(simple_fun)
# A tibble: 1 x 3
# Groups:   id [1]
  id    cat     val
    
1 a     s1        2

big_out_to_show_error
# A tibble: 2 x 2
  cat     val
   
1 s2        1
2 s1        2

How can I manage the group_by function to still throw the grouping variable in the slice so my function works with group_modify?

As a side note, I'm really trying to understand and fix the dplyr group_by behavior. I already know the base R way to do it:

split(dd, dd$id) %>% 
  lapply(simple_fun) %>% 
  do.call("rbind", .)
  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

Thanks

Iroha · Accepted Answer

group_modify() creates two objects for each group - a tibble containing the subset data, and a separate single row tibble containing the group information.

Because the group information will be restored automatically when group_modify() returns the data, it's generally not necessary for this information to be kept in the subset data so, by default, it is removed. However, you can use the .keep argument to retain it but this will cause an error if the group variables are present when the data is returned by your function.

So you can fix your function by using the .keep argument and then removing the grouping variables before the data is returned:

simple_fun <- function(slice, key){

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])
  
  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val), .groups = "drop")
  }   
  temp2 %>%
    select(-id)      
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun, .keep = TRUE)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
    
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

You can also simplify the function to sidestep this issue altogether:

simple_fun2 <- function(slice, key){

slice %>% 
    slice_min(cat, 1) %>%
    summarise(cat = unique(cat),
              val = mean(val))
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun2)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
    
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

Managing dplyr group_by function to keep the grouping variable when used in combination with group_modify

Answers (2)

Related Questions