Creating buckets based on a target variable

Question

DSA	3M Contribution	6M Contribution
ABC	1.3%	3.6%
BCA	3.4%	1.5%
CDE	0.8%	0.1%
DCA	0.1%	2%
EDA	2%	0.7%

I have a dataset with 3 variables with column name DSA having upto 180 levels. Column 3M and 6M contributions add up to 100% respectively. I want a random grouping within DSA where the sum of 3M contribution accounts to 10% in each group. It doesn't matter the count of DSA in the group as long as it accounts to 10%, for eg. 20 DSA can account to 10% in one group and 15 DSA can account to 10% in another group. Key here is randomness in the distribution in different groups. Same solution i'll replicate for 6M contribution. Would request a solution with minimal hard coding.

Forgot to add one thing, I need a possible solution which breaks my dataset into 10 bins/groups of +- 10% contribution in each group (each group being unique from one another). It doesn't have to be exactly 10% in each group, tolerance range of +- 0.5% is acceptable.

Allan Cameron · Accepted Answer

It's not clear how close to 10% you need the values to add to. Here's a solution that should give acceptable results unless you need the values to add up to exactly 10, in which case you may be very limited in the sets that you can "randomly" choose, as per Konrad Rudolph's comments.

We can replicate your data structure like this:

set.seed(3)

df <- data.frame(DSA = apply(replicate(3, sample(LETTERS, 180, TRUE)), 1, 
                             paste, collapse = ""),
                 x = rexp(180),
                 y = rexp(180))

df$x <- 100 * df$x/sum(df$x)
df$y <- 100 * df$y/sum(df$y)

names(df) <- c("DSA", "3M Contribution", "6M Contribution")

So now we have:

head(df)
#>   DSA 3M Contribution 6M Contribution
#> 1 EFS       0.6825126       1.2142943
#> 2 ZDL       0.0751734       0.1383280
#> 3 LWK       1.6487536       0.6453974
#> 4 GBY       2.1218616       0.2870302
#> 5 DDU       0.3414601       2.6318194
#> 6 ZHT       0.2341914       0.4497800

There are 180 rows, each with a unique 3 letter identifier for DSA and each of the 3M Contribtion and 6M contribution columns add to 100:

length(unique(df$DSA))
#> [1] 180

sum(df$`3M Contribution`)
#> [1] 100

sum(df$`6M Contribution`)
#> [1] 100

Now that we have the data, we can select a random ordering of the data frame by doing:

sample_df <- df[sample(nrow(df)),]

All this does is put the data frame rows in an arbitrary order. However, we can now get only the first n rows that allow 3M contribution to (approximately) add to 10 like this:

sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]

sample_df
#>     DSA 3M Contribution 6M Contribution
#> 59  QQQ      0.85167563      0.02718667
#> 43  WJY      0.35287431      1.40795918
#> 165 ISF      0.47397401      0.32810467
#> 114 GVV      0.35121926      0.56594478
#> 42  EHT      1.44584224      0.33704927
#> 71  DJX      0.84489716      0.16338381
#> 2   ZDL      0.07517340      0.13832804
#> 110 PZM      0.98701927      0.07642120
#> 11  JME      0.22959383      2.31503532
#> 153 BER      0.09994387      0.61655857
#> 81  UXU      0.22244869      0.31506248
#> 166 QYV      0.86343465      1.62441563
#> 171 SPH      2.43818827      0.84201677
#> 131 MAB      0.64740835      1.06533067

And we can see that we have a completely random selection of 14 DSAs that add up to (almost) 10:

sum(sample_df$`3M Contribution`)
#> [1] 9.883693

And if we run it again, we get a completely different set

sample_df <- df[sample(nrow(df)),]
sample_df <- sample_df[cumsum(sample_df$`3M Contribution`) < 10,]

sample_df
#>     DSA 3M Contribution 6M Contribution
#> 165 ISF      0.47397401      0.32810467
#> 2   ZDL      0.07517340      0.13832804
#> 129 XAN      0.68953299      0.23862892
#> 37  OFV      0.32093896      0.54626344
#> 56  ZSU      0.85837341      0.15694326
#> 42  EHT      1.44584224      0.33704927
#> 115 YEB      0.05722397      0.81590015
#> 89  SQV      0.34496853      0.32799710
#> 67  DBT      0.82052602      0.47833734
#> 124 ZOI      0.01553935      0.49215515
#> 125 GLS      1.87641137      0.70018877
#> 91  AOW      0.55159848      0.09659654
#> 38  JPH      0.05702578      0.02310634
#> 62  RGO      0.57398460      0.20540972
#> 139 LOA      0.34511367      0.38109741
#> 154 NWH      1.04575037      1.08159278
#> 114 GVV      0.35121926      0.56594478
#> 47  DIL      0.07265622      0.18332102

On this occasion, we have a set of 18 randomly chosen DSAs that again add up to almost 10:

sum(sample_df$`3M Contribution`)
#> [1] 9.975853

And if you want to get 10 such samples, you can get them all in a list, like so:

all_samples <- lapply(1:10, function(x) {
  sample_df <- df[sample(nrow(df)),];
  sample_df[cumsum(sample_df$`3M Contribution`) < 10,]
})

Now we have a list of 10 samples called all_samples, and we can get each just by using subscripts. For example, the fifth such sample would be obtained like this:

all_samples[[5]]
#>     DSA 3M Contribution 6M Contribution
#> 147 YQZ      0.85928604      0.07089867
#> 84  WUD      0.15243070      0.09202777
#> 103 WSQ      0.27323608      1.79443012
#> 166 QYV      0.86343465      1.62441563
#> 1   EFS      0.68251264      1.21429430
#> 52  YTJ      0.61530262      0.47584261
#> 98  DVY      0.61661083      0.46831908
#> 3   LWK      1.64875363      0.64539744
#> 8   KDV      0.05386368      0.31977784
#> 41  CQE      0.85948052      0.14494440
#> 99  YSU      0.69153090      0.96764012
#> 72  LZZ      0.77170284      0.29414941
#> 170 BRA      0.15377281      0.54707772
#> 95  OAF      0.68996185      0.73363205
#> 69  ICC      0.71197929      0.16501976
#> 82  BMR      0.12875373      0.07509324
#> 169 HWR      0.01473804      0.71425244
#> 124 ZOI      0.01553935      0.49215515
#> 157 SGN      0.06565888      0.29731670

And we can see that the sum is still close to 10%:

sum(all_samples[[5]]$`3M Contribution`)
[1] 9.868549

Creating buckets based on a target variable

Answers (1)

Related Questions