Reputation: 3176
I'm trying to use the function group_modify
(which I've learned about here).
The goal is to take a data.frame
, split it with group_by
and then apply a home made function that do some reorganisation (namely sorting, selecting the "best row" and if more than one, average the values). I need the output data.frame
to have all the columns of the original one.
Here is a RE that will make everything clearer:
The data:
library(dplyr)
(dd <- data.frame(id = c("a", "a", "b", "b", "c", "c", "c"), cat = c("s2", "s1", "s1", "s1", "s3", "s2", "s2"), val = 1:7))
id cat val
1 a s2 1
2 a s1 2
3 b s1 3
4 b s1 4
5 c s3 5
6 c s2 6
7 c s2 7
My function (basic one that shows my problem, but not exactly the one I'm actually using):
simple_fun <- function(slice, key){
big_out_to_show_error <<- slice
temp1 <- arrange(slice, cat)
temp2 <- temp1 %>%
filter(cat==temp1$cat[1])
if(nrow(temp2)>1) {
temp2 <- temp2 %>%
group_by(id, cat) %>%
summarise(val = mean(val))
}
return(data.frame(temp2))
}
The output I want (one row per ID having the "best" cat
and if more than one row, average of val
and having all the variables from the original data.frame
):
id cat val
a a s1 2.0
b b s1 3.5
c c s2 6.5
My try with dplyr::group_modify
function throws an error:
dd %>%
group_by(id) %>%
group_modify(simple_fun)
Show Traceback
Rerun with Debug
Error: Column `id` is unknown
This is because the slice
that is used do not include the grouping variable. This can be seen by this simple code that uses the line big_out_to_show_error <<- slice
in the main function and limiting to id=="a"
:
filter(dd, id=="a") %>%
group_by(id) %>%
group_modify(simple_fun)
# A tibble: 1 x 3
# Groups: id [1]
id cat val
<fct> <fct> <int>
1 a s1 2
big_out_to_show_error
# A tibble: 2 x 2
cat val
<fct> <int>
1 s2 1
2 s1 2
How can I manage the group_by
function to still throw the grouping variable in the slice so my function works with group_modify
?
As a side note, I'm really trying to understand and fix the dplyr
group_by
behavior. I already know the base R way to do it:
split(dd, dd$id) %>%
lapply(simple_fun) %>%
do.call("rbind", .)
id cat val
a a s1 2.0
b b s1 3.5
c c s2 6.5
Thanks
Upvotes: 2
Views: 1377
Reputation: 3176
27 ϕ 9 answer is perfect and answer my question. Now, considering that there are multiple options to analyse the dataset and that my dataset is quite big (1.3 million lines), I did a quick benchmark to compare the Base R (split
/lapply
) and the Tidyverse (group_by
/group_modify
) approaches using both possible functions (the one that uses arrange
and the one that uses slice_min
).
It may not be optimal/perfect/state of the art programmation but it was a quick and dirty comparison which give a fair idea of the most efficient way to do this analysis.
library(dplyr)
library(microbenchmark)
library(ggplot2)
nbrows <- 200
set.seed(1234)
bigdd <- data.frame(id = sample(nbrows/2, nbrows, replace = T),
cat = sample(c("S1", "S2", "S3"), nbrows, replace = T),
val = runif(nbrows)) %>%
arrange(id)
f_baser_arrange <- function(dd){
simple_fun0 <- function(slice, key){
temp1 <- arrange(slice, cat)
temp2 <- temp1 %>%
filter(cat==temp1$cat[1])
if(nrow(temp2)>1) {
temp2 <- temp2 %>%
group_by(id, cat) %>%
summarise(val = mean(val), .groups = 'drop')
}
return(data.frame(temp2))
}
split(dd, dd$id) %>%
lapply(simple_fun0) %>%
do.call("rbind", .)
}
f_baser_slice_min <- function(dd){
simple_fun3 <- function(slice, key){
slice %>%
slice_min(cat, 1) %>%
summarise(id = unique(id),
cat = unique(cat),
val = mean(val))
}
split(dd, dd$id) %>%
lapply(simple_fun3) %>%
do.call("rbind", .)
}
f_tidy_arrange <- function(dd){
simple_fun1 <- function(slice, key){
temp1 <- arrange(slice, cat)
temp2 <- temp1 %>%
filter(cat==temp1$cat[1])
if(nrow(temp2)>1) {
temp2 <- temp2 %>%
group_by(cat) %>%
summarise(val = mean(val), .groups = 'drop')
}
return(data.frame(temp2))
}
dd %>%
group_by(id) %>%
group_modify(simple_fun1)
}
f_tidy_slice_min <- function(dd){
simple_fun2 <- function(slice, key){
slice %>%
slice_min(cat, 1) %>%
summarise(cat = unique(cat),
val = mean(val))
}
dd %>%
group_by(id) %>%
group_modify(simple_fun2)
}
res <- microbenchmark(f_baser_arrange(bigdd),
f_baser_slice_min(bigdd),
f_tidy_arrange(bigdd),
f_tidy_slice_min(bigdd),
times = 100)
data.frame(res) %>%
mutate(Philosophy = ifelse(grepl("baser", expr), "Base R", "Tidyverse"),
Method = ifelse(grepl("arrange", expr), "arrange", "slice_min")) %>%
ggplot(aes(x=Philosophy, y=time, color=Method))+
geom_boxplot(position=position_dodge(0.5))
We notice that the base R split
/lapply
approach is generally faster than the Tidyverse group_by
/group_modify
way. We also notice that @27 ϕ 9 slice_min
is faster than my original arrange
approach.
Also, the base R approach and be speed up even more by changing the lapply
with parLapply
.
Upvotes: 0
Reputation: 34751
group_modify()
creates two objects for each group - a tibble containing the subset data, and a separate single row tibble containing the group information.
Because the group information will be restored automatically when group_modify()
returns the data, it's generally not necessary for this information to be kept in the subset data so, by default, it is removed. However, you can use the .keep
argument to retain it but this will cause an error if the group variables are present when the data is returned by your function.
So you can fix your function by using the .keep
argument and then removing the grouping variables before the data is returned:
simple_fun <- function(slice, key){
temp1 <- arrange(slice, cat)
temp2 <- temp1 %>%
filter(cat==temp1$cat[1])
if(nrow(temp2)>1) {
temp2 <- temp2 %>%
group_by(id, cat) %>%
summarise(val = mean(val), .groups = "drop")
}
temp2 %>%
select(-id)
}
dd %>%
group_by(id) %>%
group_modify(simple_fun, .keep = TRUE)
# A tibble: 3 x 3
# Groups: id [3]
id cat val
<chr> <chr> <dbl>
1 a s1 2
2 b s1 3.5
3 c s2 6.5
You can also simplify the function to sidestep this issue altogether:
simple_fun2 <- function(slice, key){
slice %>%
slice_min(cat, 1) %>%
summarise(cat = unique(cat),
val = mean(val))
}
dd %>%
group_by(id) %>%
group_modify(simple_fun2)
# A tibble: 3 x 3
# Groups: id [3]
id cat val
<chr> <chr> <dbl>
1 a s1 2
2 b s1 3.5
3 c s2 6.5
Upvotes: 1