Reputation: 3071
Let's say we have the following dataframe, which is the result of a previous dplyr::sumarise()
call:
df <- data.frame(x = c(1, 2, 3, 4, 5),
n = c(8, 7, 3, 3, 1))
Here df$x
is a value which we later want to group over and df$n
states how many items in the original table had the given df$x
.
However, there are currently too many groups of df$x
, such that the higher values have very low $n
's, so I want to collapse groups into "supergroups", such that each contains a total $n
of at least 4.
I've found many similar questions on SO (this being the most cited), but none quite like this.
The best solution I've found isn't very elegant, but works:
suppressPackageStartupMessages(library(dplyr))
mySplit <- function(x, n) {
a <- seq(n)
count <- 0
j <- 1
for(i in rev(x)) {
a[i] <- j
count <- count + n[i]
if (count >= 4) {
j <- j + 1
count <- 0
}
}
return(a)
}
df <- data.frame(x = c(1, 2, 3, 4, 5),
n = c(8, 7, 3, 3, 1))
df %>%
mutate(
g = mySplit(x, n)
) %>%
group_by(
desc(g)
) %>%
summarise(
x = x[1]
, n = sum(n)
)
#> # A tibble: 3 x 3
#> `desc(g)` x n
#> <dbl> <dbl> <dbl>
#> 1 -3 1 8
#> 2 -2 2 10
#> 3 -1 4 4
Created on 2020-01-13 by the reprex package (v0.3.0)
From that output, we can see that it merged $x %in% (2, 3)
and $x %in% (4, 5)
, such that no group has less than 4 items.
However, this isn't a very good solution for many reasons:
$x
will almost always have the lower $n
s, so the risk of having poor grouping is reduced.$x = 1, 2, (3, 4, 5)
, which would have created supergroups of almost equal size (8, 7 and 7).So, is there a better way to do this I'm unaware of?
This is basically a "rolling sum" function which resets the sum whenever a value is crossed and instead returns the number of "previous resets" (though an optimal solution as stated above would need something much more complex). However, I can't figure out how to actually code that.
Upvotes: 1
Views: 191
Reputation: 389175
We can use cumsumbinning
from MESS
. For this case, you need cutwhenpassed=TRUE
which is not available on CRAN yet but you can install it from github.
devtools::install_github("ekstroem/MESS")
cumsumbinning
basically creates groups until cumulative sum passes a threshold.
MESS::cumsumbinning(df$n, 4, cutwhenpassed = TRUE)
#[1] 1 2 3 3 4
So here, we can use it as :
library(dplyr)
threshold <- 4
df %>%
group_by(g = MESS::cumsumbinning(n, threshold, cutwhenpassed = TRUE)) %>%
summarise(x = first(x), n = sum(n))
# A tibble: 4 x 3
# g x n
# <int> <dbl> <dbl>
#1 1 1 8
#2 2 2 7
#3 3 3 6
#4 4 5 1
Upvotes: 1