Reputation: 131
I want to take a df as the one below and want to cut/bin/group/sample into groups of size=20. Ideally, this "binning" occurs randomly across IDs rather then consecutively from top row to bottom row).
E.g. IDs 2, 29 and 71 have counts of 7,7,6 and would fit nicely into a "bin" of size=20. I want to achieve the minimum number of bins and do not care about order of IDs (the more random they are, the better).
set.seed(123)
df <- data.frame(
ID = as.numeric(1:100),
Count = as.numeric(sample(1:8, size = 100, replace = T)))
Desired outcome would be a dataframe/tibble looking something like the below with optimum random sampling and minimising bin number.
Bin_size=20 is the parameter set by me (the ideal outcome is exact 20 (=20) but <20 is ok, however, >20 is not ok). Each Bin should be given a number (e.g. if I have 10 bins, I would like them to be called Bin_number 1-10).
ID, Count, Bin_size, Bin_number
ID 2, 7, 20, 1
ID 29, 7, 20, 1
ID 71, 6, 20, 1
etc.
Where 7+7+6 = 20 (etc.)
Any help with this would be much appreciated. I have been wondering about cumsum and group_by but could not figure it out.
if you need more details, I'm happy to provide them. thanks!
Upvotes: 1
Views: 376
Reputation: 34621
The BBmisc
package has a simple (though not optimized) bin packing algorithm that might be useful:
library(BBmisc)
library(dplyr)
df %>%
as_tibble() %>%
mutate(bin = binPack(Count, 20),
bin_size = ave(Count, bin, FUN = sum)) %>%
arrange(bin)
# A tibble: 100 x 4
ID Count bin bin_size
<dbl> <dbl> <int> <dbl>
1 11 4 1 20
2 17 8 1 20
3 27 8 1 20
4 22 4 2 20
5 42 8 2 20
6 56 8 2 20
7 34 4 3 20
8 62 8 3 20
9 79 8 3 20
10 40 4 4 20
# ... with 90 more rows
Upvotes: 1
Reputation: 499
Do you want something like this,
df<-df%>%arrange(Count)%>%mutate(Sum=cumsum(Count),Bin_size=20)
df<-df%>%mutate(Bin_number=cut(Sum,breaks = seq(0,c(max(Sum)+20),20),labels = F,right = T))
Upvotes: 0