Reputation: 3462
I have the following data frame:
df = data.frame(a = c(1,1,3,2,2), b=6:10)
## a b
## 1 6
## 1 7
## 3 3
## 2 9
## 2 10
I want to analyze the data by groups (a is the grouping parameter), but instead of the usual (e.g. each value specify a group of rows, and the groups are disjoint) I need "cumulative groups". that is, for the value of a=i, the group should contain all the rows in which a<=i. These are not disjoint groups, but still I want to summarize each group separately.
So for example, if for each group I want the mean of b, the result would be:
## a mean_b
## 1 6.5
## 2 8
## 3 7
note that in the real scenario behind this simplified example, I cannot analyze disjoint group separately and then aggregate the relevant groups. the summarize function must be "aware" of all the rows in that group to perform the computation.
So of course, I can use some apply functions and compute things in the good old way, and make a new df out of it, but I look for the dplyr/tidyverse like functions to do that.
any suggestions?
Upvotes: 2
Views: 299
Reputation: 47350
I would do it this way :
df %>%
arrange(a) %>%
map_dfr(seq_along(as <- unique(.$a)),
~filter(.y, a %in% as[1:.]),.y = ., .id = "a") %>%
group_by(a = meta_group) %>%
summarise(b = mean(b))
# # A tibble: 3 x 2
# a b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
If you want a separate function you can do :
summarize2 <- function(.data, ..., .by){
grps <- select_at(.data,.by) %>% pull %>% unique
.data %>%
arrange_at(.by) %>%
map_dfr(seq_along(grps),
~ filter_at(.y, .by,all_vars(. %in% grps[1:.x])),
.y = .,
.id = "meta_group") %>%
group_by(meta_group) %>%
summarise(...)
}
df %>%
summarize2(b = mean(b), .by = "a")
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
df %>%
summarize2(b = mean(b), .by = vars(a))
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
Upvotes: 1
Reputation: 12478
I had a look and I don't see how it is possible with dplyr
itself. However, we can hack the group_by
function to make it cumulative. I'll quickly walkd you through it:
First, I make your df. It doesn't really fit your output above, so I slightly changed it.
df = data.frame(a = c(1,1,3,2,2), b=6:10)
df$b[3] <- 3
Now I use the normal group_by
to check out what it actually does to the data.frame
.
library(dplyr)
df_grouped <- df %>%
arrange(a) %>%
group_by(a)
> attributes(df_grouped)
$class
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "a" "b"
$vars
[1] "a"
$drop
[1] TRUE
$indices
$indices[[1]]
[1] 0 1
$indices[[2]]
[1] 2 3
$indices[[3]]
[1] 4
$group_sizes
[1] 2 2 1
$biggest_group_size
[1] 2
$labels
a
1 1
2 2
3 3
So besides other things, there is a new attribute called indices
where the group of each element in the grouped variable is referenced. We can actually just change that to make it cumulative.
for (i in seq_along(attributes(df_grouped)[["indices"]])[-1]) {
attributes(df_grouped)[["indices"]][[i]] <- c(
attributes(df_grouped)[["indices"]][[i - 1]],
attributes(df_grouped)[["indices"]][[i]]
)
}
It looks a bit weird but is straightforward. The elements of each group are added to the next group. E.g. all elements from group 1 are added to group 2.
> attributes(df_grouped)$indices
[[1]]
[1] 0 1
[[2]]
[1] 0 1 3 4
[[3]]
[1] 0 1 3 4 2
We can use the changed groups in the normal dplyr
way.
> df_grouped %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
Now of course this is pretty ugly and looks very hacky. But inside a function that doesn't really matter as long as it is still efficient (which it is). So let's make a custom group_by
.
group_by_cuml <- function(.data, ...) {
.data_grouped <- group_by(.data, ...)
for (i in seq_along(attributes(.data_grouped)[["indices"]])[-1]) {
attributes(.data_grouped)[["indices"]][[i]] <- c(
attributes(.data_grouped)[["indices"]][[i - 1]],
attributes(.data_grouped)[["indices"]][[i]]
)
}
return(.data_grouped)
}
Now you can use the custom function in clean dplyr
pipe.
> df %>%
+ group_by_cuml(a) %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
Upvotes: 1
Reputation: 51612
One way is to use the base function Reduce
with the argument accumulate = TRUE
. Once you concatenate, then you can apply any function, i.e.
Reduce(c, split(df$b,df$a), accumulate = TRUE)
#[[1]]
#[1] 6 7
#[[2]]
#[1] 6 7 9 10
#[[3]]
#[1] 6 7 9 10 3
and then for the mean,
sapply(Reduce(c, split(df$b,df$a), accumulate = TRUE), mean)
[1] 6.5 8.0 7.0
Upvotes: 0
Reputation: 389325
How about something like this?
library(dplyr)
df %>%
arrange(a) %>%
group_by(a) %>%
summarise(sum_b = sum(b)) %>%
ungroup() %>%
mutate(sum_b = cumsum(sum_b))
# a sum_b
# <dbl> <int>
#1 1. 13
#2 2. 32
#3 3. 40
We take sum
by group (a
) and then take cumulative sum adding the previous value of the group in the next group.
Upvotes: 3