Moe
Moe

Reputation: 131

R sampling into groups of specific size based on count data

I want to take a df as the one below and want to cut/bin/group/sample into groups of size=20. Ideally, this "binning" occurs randomly across IDs rather then consecutively from top row to bottom row).

E.g. IDs 2, 29 and 71 have counts of 7,7,6 and would fit nicely into a "bin" of size=20. I want to achieve the minimum number of bins and do not care about order of IDs (the more random they are, the better).

set.seed(123)
df <- data.frame(
  ID = as.numeric(1:100),
  Count = as.numeric(sample(1:8, size = 100, replace = T)))

Desired outcome would be a dataframe/tibble looking something like the below with optimum random sampling and minimising bin number.

Bin_size=20 is the parameter set by me (the ideal outcome is exact 20 (=20) but <20 is ok, however, >20 is not ok). Each Bin should be given a number (e.g. if I have 10 bins, I would like them to be called Bin_number 1-10).

ID, Count, Bin_size, Bin_number

ID 2, 7, 20, 1

ID 29, 7, 20, 1

ID 71, 6, 20, 1

etc.

Where 7+7+6 = 20 (etc.)

Any help with this would be much appreciated. I have been wondering about cumsum and group_by but could not figure it out.

if you need more details, I'm happy to provide them. thanks!

Upvotes: 1

Views: 376

Answers (2)

lroha
lroha

Reputation: 34621

The BBmisc package has a simple (though not optimized) bin packing algorithm that might be useful:

library(BBmisc)
library(dplyr)

df %>%
  as_tibble() %>%
  mutate(bin = binPack(Count, 20),
         bin_size = ave(Count, bin, FUN = sum)) %>%
  arrange(bin)

# A tibble: 100 x 4
      ID Count   bin bin_size
   <dbl> <dbl> <int>    <dbl>
 1    11     4     1       20
 2    17     8     1       20
 3    27     8     1       20
 4    22     4     2       20
 5    42     8     2       20
 6    56     8     2       20
 7    34     4     3       20
 8    62     8     3       20
 9    79     8     3       20
10    40     4     4       20
# ... with 90 more rows

Upvotes: 1

ssaha
ssaha

Reputation: 499

Do you want something like this,

df<-df%>%arrange(Count)%>%mutate(Sum=cumsum(Count),Bin_size=20)
df<-df%>%mutate(Bin_number=cut(Sum,breaks = seq(0,c(max(Sum)+20),20),labels = F,right = T))

Upvotes: 0

Related Questions