Bastien
Bastien

Reputation: 3176

Managing dplyr group_by function to keep the grouping variable when used in combination with group_modify

I'm trying to use the function group_modify (which I've learned about here).

The goal is to take a data.frame, split it with group_by and then apply a home made function that do some reorganisation (namely sorting, selecting the "best row" and if more than one, average the values). I need the output data.frame to have all the columns of the original one.

Here is a RE that will make everything clearer:

The data:

library(dplyr)
(dd <- data.frame(id = c("a", "a", "b", "b", "c", "c", "c"), cat = c("s2", "s1", "s1", "s1", "s3", "s2", "s2"), val = 1:7))
  id cat val
1  a  s2   1
2  a  s1   2
3  b  s1   3
4  b  s1   4
5  c  s3   5
6  c  s2   6
7  c  s2   7

My function (basic one that shows my problem, but not exactly the one I'm actually using):

simple_fun <- function(slice, key){
  big_out_to_show_error <<- slice

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])

  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val))
  }
  
  return(data.frame(temp2))
  
}

The output I want (one row per ID having the "best" cat and if more than one row, average of val and having all the variables from the original data.frame):

  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

My try with dplyr::group_modify function throws an error:

dd %>% 
   group_by(id) %>%
   group_modify(simple_fun)
 Show Traceback
 
 Rerun with Debug
 Error: Column `id` is unknown 

This is because the slice that is used do not include the grouping variable. This can be seen by this simple code that uses the line big_out_to_show_error <<- slice in the main function and limiting to id=="a":

filter(dd, id=="a") %>% 
   group_by(id) %>%
   group_modify(simple_fun)
# A tibble: 1 x 3
# Groups:   id [1]
  id    cat     val
  <fct> <fct> <int>
1 a     s1        2

big_out_to_show_error
# A tibble: 2 x 2
  cat     val
  <fct> <int>
1 s2        1
2 s1        2

How can I manage the group_by function to still throw the grouping variable in the slice so my function works with group_modify?

As a side note, I'm really trying to understand and fix the dplyr group_by behavior. I already know the base R way to do it:

split(dd, dd$id) %>% 
  lapply(simple_fun) %>% 
  do.call("rbind", .)
  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

Thanks

Upvotes: 2

Views: 1377

Answers (2)

Bastien
Bastien

Reputation: 3176

27 ϕ 9 answer is perfect and answer my question. Now, considering that there are multiple options to analyse the dataset and that my dataset is quite big (1.3 million lines), I did a quick benchmark to compare the Base R (split/lapply) and the Tidyverse (group_by/group_modify) approaches using both possible functions (the one that uses arrange and the one that uses slice_min).

It may not be optimal/perfect/state of the art programmation but it was a quick and dirty comparison which give a fair idea of the most efficient way to do this analysis.

library(dplyr)
library(microbenchmark)
library(ggplot2)

nbrows <- 200
set.seed(1234)
bigdd <- data.frame(id = sample(nbrows/2, nbrows, replace = T), 
                    cat = sample(c("S1", "S2", "S3"), nbrows, replace = T),
                    val = runif(nbrows)) %>% 
  arrange(id)

f_baser_arrange <- function(dd){
  
  simple_fun0 <- function(slice, key){
    temp1 <- arrange(slice, cat)
    temp2 <- temp1 %>% 
      filter(cat==temp1$cat[1])
    if(nrow(temp2)>1) {
      temp2 <- temp2 %>% 
        group_by(id, cat) %>% 
        summarise(val = mean(val), .groups = 'drop')
    }
    return(data.frame(temp2))
  }
  
  split(dd, dd$id) %>% 
    lapply(simple_fun0) %>% 
    do.call("rbind", .)
}

f_baser_slice_min <- function(dd){
  simple_fun3 <- function(slice, key){
    slice %>% 
      slice_min(cat, 1) %>%
      summarise(id = unique(id),
                cat = unique(cat),
                val = mean(val))
  }
  
  split(dd, dd$id) %>% 
    lapply(simple_fun3) %>% 
    do.call("rbind", .)
}

f_tidy_arrange <- function(dd){
  simple_fun1 <- function(slice, key){
    temp1 <- arrange(slice, cat)
    temp2 <- temp1 %>% 
      filter(cat==temp1$cat[1])
    if(nrow(temp2)>1) {
      temp2 <- temp2 %>% 
        group_by(cat) %>% 
        summarise(val = mean(val), .groups = 'drop')
    }
    return(data.frame(temp2))
  }
  
  dd %>% 
    group_by(id) %>%
    group_modify(simple_fun1)
}

f_tidy_slice_min <- function(dd){
  simple_fun2 <- function(slice, key){
    slice %>% 
      slice_min(cat, 1) %>%
      summarise(cat = unique(cat),
                val = mean(val))
  }
  
  dd %>% 
    group_by(id) %>%
    group_modify(simple_fun2)
}

res <- microbenchmark(f_baser_arrange(bigdd),
               f_baser_slice_min(bigdd),
               f_tidy_arrange(bigdd),
               f_tidy_slice_min(bigdd),
               times = 100)

data.frame(res) %>% 
  mutate(Philosophy = ifelse(grepl("baser", expr), "Base R", "Tidyverse"),
         Method = ifelse(grepl("arrange", expr), "arrange", "slice_min")) %>% 
  ggplot(aes(x=Philosophy, y=time, color=Method))+
  geom_boxplot(position=position_dodge(0.5))

Which produces: enter image description here

We notice that the base R split/lapply approach is generally faster than the Tidyverse group_by/group_modify way. We also notice that @27 ϕ 9 slice_min is faster than my original arrange approach.

Also, the base R approach and be speed up even more by changing the lapply with parLapply.

Upvotes: 0

Iroha
Iroha

Reputation: 34751

group_modify() creates two objects for each group - a tibble containing the subset data, and a separate single row tibble containing the group information.

Because the group information will be restored automatically when group_modify() returns the data, it's generally not necessary for this information to be kept in the subset data so, by default, it is removed. However, you can use the .keep argument to retain it but this will cause an error if the group variables are present when the data is returned by your function.

So you can fix your function by using the .keep argument and then removing the grouping variables before the data is returned:

simple_fun <- function(slice, key){

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])
  
  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val), .groups = "drop")
  }   
  temp2 %>%
    select(-id)      
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun, .keep = TRUE)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
  <chr> <chr> <dbl>
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

You can also simplify the function to sidestep this issue altogether:

simple_fun2 <- function(slice, key){

slice %>% 
    slice_min(cat, 1) %>%
    summarise(cat = unique(cat),
              val = mean(val))
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun2)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
  <chr> <chr> <dbl>
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

Upvotes: 1

Related Questions