amit
amit

Reputation: 3462

cumulative grouping

I have the following data frame:

df = data.frame(a = c(1,1,3,2,2), b=6:10)
##  a    b
##  1    6
##  1    7
##  3    3
##  2    9
##  2    10

I want to analyze the data by groups (a is the grouping parameter), but instead of the usual (e.g. each value specify a group of rows, and the groups are disjoint) I need "cumulative groups". that is, for the value of a=i, the group should contain all the rows in which a<=i. These are not disjoint groups, but still I want to summarize each group separately.

So for example, if for each group I want the mean of b, the result would be:

##    a    mean_b
##    1    6.5
##    2    8
##    3    7

note that in the real scenario behind this simplified example, I cannot analyze disjoint group separately and then aggregate the relevant groups. the summarize function must be "aware" of all the rows in that group to perform the computation.

So of course, I can use some apply functions and compute things in the good old way, and make a new df out of it, but I look for the dplyr/tidyverse like functions to do that.

any suggestions?

Upvotes: 2

Views: 299

Answers (4)

moodymudskipper
moodymudskipper

Reputation: 47350

I would do it this way :

df %>% 
  arrange(a) %>%
  map_dfr(seq_along(as <- unique(.$a)),
          ~filter(.y, a %in% as[1:.]),.y = ., .id = "a") %>%
  group_by(a = meta_group) %>%
  summarise(b = mean(b))

# # A tibble: 3 x 2
# a     b
# <chr> <dbl>
# 1     1   6.5
# 2     2   7.0
# 3     3   8.0

If you want a separate function you can do :

summarize2 <- function(.data, ..., .by){
  grps <- select_at(.data,.by) %>% pull %>% unique
  .data %>%
    arrange_at(.by) %>%
    map_dfr(seq_along(grps),
             ~ filter_at(.y, .by,all_vars(. %in% grps[1:.x])),
             .y = .,
             .id = "meta_group") %>%
    group_by(meta_group) %>%
    summarise(...)
}

df %>% 
  summarize2(b = mean(b), .by = "a")
# # A tibble: 3 x 2
#   meta_group     b
#        <chr> <dbl>
# 1          1   6.5
# 2          2   7.0
# 3          3   8.0

df %>% 
  summarize2(b = mean(b), .by = vars(a))
# # A tibble: 3 x 2
#   meta_group     b
#        <chr> <dbl>
# 1          1   6.5
# 2          2   7.0
# 3          3   8.0

Upvotes: 1

JBGruber
JBGruber

Reputation: 12478

I had a look and I don't see how it is possible with dplyr itself. However, we can hack the group_by function to make it cumulative. I'll quickly walkd you through it:

First, I make your df. It doesn't really fit your output above, so I slightly changed it.

df = data.frame(a = c(1,1,3,2,2), b=6:10)
df$b[3] <- 3

Now I use the normal group_by to check out what it actually does to the data.frame.

library(dplyr)
df_grouped <- df %>%
  arrange(a) %>%
  group_by(a)
> attributes(df_grouped)
$class
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

$row.names
[1] 1 2 3 4 5

$names
[1] "a" "b"

$vars
[1] "a"

$drop
[1] TRUE

$indices
$indices[[1]]
[1] 0 1

$indices[[2]]
[1] 2 3

$indices[[3]]
[1] 4


$group_sizes
[1] 2 2 1

$biggest_group_size
[1] 2

$labels
  a
1 1
2 2
3 3

So besides other things, there is a new attribute called indices where the group of each element in the grouped variable is referenced. We can actually just change that to make it cumulative.

for (i in seq_along(attributes(df_grouped)[["indices"]])[-1]) {
  attributes(df_grouped)[["indices"]][[i]] <- c(
    attributes(df_grouped)[["indices"]][[i - 1]],
    attributes(df_grouped)[["indices"]][[i]]
  )
}

It looks a bit weird but is straightforward. The elements of each group are added to the next group. E.g. all elements from group 1 are added to group 2.

> attributes(df_grouped)$indices
[[1]]
[1] 0 1

[[2]]
[1] 0 1 3 4

[[3]]
[1] 0 1 3 4 2

We can use the changed groups in the normal dplyr way.

> df_grouped %>%
+   summarise(sum_b = mean(b))
# A tibble: 3 x 2
      a sum_b
  <dbl> <dbl>
1     1   6.5
2     2   8  
3     3   7 

Now of course this is pretty ugly and looks very hacky. But inside a function that doesn't really matter as long as it is still efficient (which it is). So let's make a custom group_by.

group_by_cuml <- function(.data, ...) {
  .data_grouped <- group_by(.data, ...)
  for (i in seq_along(attributes(.data_grouped)[["indices"]])[-1]) {
    attributes(.data_grouped)[["indices"]][[i]] <- c(
      attributes(.data_grouped)[["indices"]][[i - 1]],
      attributes(.data_grouped)[["indices"]][[i]]
    )
  }
  return(.data_grouped)
}

Now you can use the custom function in clean dplyr pipe.

> df %>%
+   group_by_cuml(a) %>% 
+   summarise(sum_b = mean(b))
# A tibble: 3 x 2
      a sum_b
  <dbl> <dbl>
1     1   6.5
2     2   8  
3     3   7  

Upvotes: 1

Sotos
Sotos

Reputation: 51612

One way is to use the base function Reduce with the argument accumulate = TRUE. Once you concatenate, then you can apply any function, i.e.

Reduce(c, split(df$b,df$a), accumulate = TRUE)
#[[1]]
#[1] 6 7

#[[2]]
#[1]  6  7  9 10

#[[3]]
#[1]  6  7  9 10  3

and then for the mean,

sapply(Reduce(c, split(df$b,df$a), accumulate = TRUE), mean)
[1] 6.5 8.0 7.0

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389325

How about something like this?

library(dplyr)

df %>%
 arrange(a) %>%
 group_by(a) %>%
 summarise(sum_b = sum(b)) %>%
 ungroup() %>%
 mutate(sum_b = cumsum(sum_b))

#     a sum_b
#  <dbl> <int>
#1    1.    13
#2    2.    32
#3    3.    40

We take sum by group (a) and then take cumulative sum adding the previous value of the group in the next group.

Upvotes: 3

Related Questions