Create groups with a minimum number of elements

Question

Let's say we have the following dataframe, which is the result of a previous dplyr::sumarise() call:

df <- data.frame(x = c(1, 2, 3, 4, 5),
                 n = c(8, 7, 3, 3, 1))

Here df$x is a value which we later want to group over and df$n states how many items in the original table had the given df$x.

However, there are currently too many groups of df$x, such that the higher values have very low $n's, so I want to collapse groups into "supergroups", such that each contains a total $n of at least 4.

I've found many similar questions on SO (this being the most cited), but none quite like this.

The best solution I've found isn't very elegant, but works:

suppressPackageStartupMessages(library(dplyr))

mySplit <- function(x, n) {
  a <- seq(n)

  count <- 0
  j <- 1
  for(i in rev(x)) {
    a[i] <- j
    count <- count + n[i]
    if (count >= 4) {
      j <- j + 1
      count <- 0
    }
  }

  return(a)
}

df <- data.frame(x = c(1, 2, 3, 4, 5),
                 n = c(8, 7, 3, 3, 1))

df %>%
  mutate(
    g = mySplit(x, n)
  ) %>%
  group_by(
    desc(g)
  ) %>%
  summarise(
    x = x[1]
    , n = sum(n)
  )
#> # A tibble: 3 x 3
#>   `desc(g)`     x     n
#>         
#> 1        -3     1     8
#> 2        -2     2    10
#> 3        -1     4     4

^{Created on 2020-01-13 by the reprex package (v0.3.0)}

From that output, we can see that it merged $x %in% (2, 3) and $x %in% (4, 5), such that no group has less than 4 items.

However, this isn't a very good solution for many reasons:

As already stated, it's inelegant
It's greedy. This is acceptable since I've made it go from back to front; the latter values of $x will almost always have the lower $ns, so the risk of having poor grouping is reduced.
It's non-optimal. The best split would arguably have been to leave groups $x = 1, 2, (3, 4, 5), which would have created supergroups of almost equal size (8, 7 and 7).

So, is there a better way to do this I'm unaware of?

This is basically a "rolling sum" function which resets the sum whenever a value is crossed and instead returns the number of "previous resets" (though an optimal solution as stated above would need something much more complex). However, I can't figure out how to actually code that.

Create groups with a minimum number of elements

Answers (1)

Related Questions