Wasabi
Wasabi

Reputation: 3071

Create groups with a minimum number of elements

Let's say we have the following dataframe, which is the result of a previous dplyr::sumarise() call:

df <- data.frame(x = c(1, 2, 3, 4, 5),
                 n = c(8, 7, 3, 3, 1))

Here df$x is a value which we later want to group over and df$n states how many items in the original table had the given df$x.

However, there are currently too many groups of df$x, such that the higher values have very low $n's, so I want to collapse groups into "supergroups", such that each contains a total $n of at least 4.

I've found many similar questions on SO (this being the most cited), but none quite like this.

The best solution I've found isn't very elegant, but works:

suppressPackageStartupMessages(library(dplyr))

mySplit <- function(x, n) {
  a <- seq(n)

  count <- 0
  j <- 1
  for(i in rev(x)) {
    a[i] <- j
    count <- count + n[i]
    if (count >= 4) {
      j <- j + 1
      count <- 0
    }
  }

  return(a)
}

df <- data.frame(x = c(1, 2, 3, 4, 5),
                 n = c(8, 7, 3, 3, 1))

df %>%
  mutate(
    g = mySplit(x, n)
  ) %>%
  group_by(
    desc(g)
  ) %>%
  summarise(
    x = x[1]
    , n = sum(n)
  )
#> # A tibble: 3 x 3
#>   `desc(g)`     x     n
#>       <dbl> <dbl> <dbl>
#> 1        -3     1     8
#> 2        -2     2    10
#> 3        -1     4     4

Created on 2020-01-13 by the reprex package (v0.3.0)

From that output, we can see that it merged $x %in% (2, 3) and $x %in% (4, 5), such that no group has less than 4 items.

However, this isn't a very good solution for many reasons:

So, is there a better way to do this I'm unaware of?

This is basically a "rolling sum" function which resets the sum whenever a value is crossed and instead returns the number of "previous resets" (though an optimal solution as stated above would need something much more complex). However, I can't figure out how to actually code that.

Upvotes: 1

Views: 191

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389175

We can use cumsumbinning from MESS. For this case, you need cutwhenpassed=TRUE which is not available on CRAN yet but you can install it from github.

devtools::install_github("ekstroem/MESS")

cumsumbinning basically creates groups until cumulative sum passes a threshold.

MESS::cumsumbinning(df$n, 4, cutwhenpassed = TRUE)
#[1] 1 2 3 3 4

So here, we can use it as :

library(dplyr)
threshold <- 4

df %>%
  group_by(g = MESS::cumsumbinning(n, threshold, cutwhenpassed = TRUE)) %>%
  summarise(x = first(x), n = sum(n))

# A tibble: 4 x 3
#      g     x     n
#  <int> <dbl> <dbl>
#1     1     1     8
#2     2     2     7
#3     3     3     6
#4     4     5     1

Upvotes: 1

Related Questions