Reputation: 157
Say I have a set of numbers and I want to sum them to fit cohorts based on a predetermined distribution. A simple example would be if the cumulative amount of a set of numbers is 100 and the distribution is 0.2, 0.3, 0.5 for cohorts 1,2 and 3 respectively then I'd want to find a subset of numbers whose sum is 20, another unique subset whose sum is 30 and a final unique subset whose sum is 50. Obviously it doesn't have to be exact it just must match the distribution reasonably closely.
I have a way in vba in which I can utilize the solver addin to find the best way to take a subset of numbers within a set of numbers and get close to (within 3000 say) a predetermined distribution. This involve using a sumproduct with a binary 0,1 constraint and the list of numbers, and then finding the difference between the total amount required at that cohort and the sumproduct described.
Once this has been done, any number in this subset is removed and we perform the solver method with a shrunk subset. I have attached an image of the procedure's evolution that I hope is clear, the colors correspond to the iteration. The first (green) iteration we have the full list and the variables that change are those in the corresponding green column containing 0/1, to get a sum product close to 142,449.09
Note the sum of the full list is: 1,424,490.85 in this example.
The "Difference" line is the solver objective, and after each iteration the objective is one column shift to the right. (I have set it so that if the difference is within 1000 then it displays zero - as this appeared to speed up the method). The simulated is that calculated from the corresponding colored sumproduct and the theoretical is simply just the probability multiplied by the total sum of all the numbers.
I have attached the code below but in reality this method isn't time effective, especially if I have to do this across multiple datasets - which is the reality of the problem. I'd like to be able to move this project into a more efficient language like R (which I've had experience with - albeit not to a high degree), as I believe it could do this process quicker and more effectively?
I am also aware my algorithm has flaws as some of the later cohorts won't be as accurate as we are observing a smaller dataset. It seems to include zeros in the sum product which I don't want it to do (see grey column). Also I want all the numbers to be used and sometimes a number will be omitted as it's inclusion means it is further away from the theoretical distribution. I am not sure how to do the above so I'd appreciate some advice on this front.
Has anyone done such a thing in R?
I also understand this could be a problem for Cross Validated - I really wasn't sure so feel free to move. I have attached the code and the tables in text form below.
Thanks in advance,
Sub solversimple()
Dim wb As Workbook
Dim ws As Worksheet
Dim rCell, rIter, rSum
Dim i as Integer
Set wb = Application.ThisWorkbook
Set ws = wb.Sheets("Output")
For i = 1 To 5
rCell = ws.Range("q8").Offset(0, i - 1).Address
rChange = ws.Range("h4:h36").Offset(0, i - 1).Address
rSum = ws.Range("I5:I39").Offset(0, i - 1).Address
solverreset
SolverOk SetCell:=rCell, MaxMinVal:=2, ValueOf:=0, ByChange:=rChange, _
Engine:=3, EngineDesc:="Evolutionary"
SolverAdd CellRef:=rChange, Relation:=5, FormulaText:="binary"
SolverAdd CellRef:=rSum, Relation:=5, FormulaText:="binary"
SolverSolve True
Next i
End Sub
Full List List after 1st It List after 2nd List after 3rd List after 4th 1 2 3 4 5
49000.21 49000.21 49000.21 49000.21 49000.21 0.00 0.00 0.00 0.00 1.00
51591.99 51591.99 51591.99 51591.99 51591.99 0.00 0.00 0.00 0.00 1.00
18390.18 18390.18 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
45490.39 45490.39 45490.39 45490.39 45490.39 0.00 0.00 0.00 0.00 1.00
37506.41 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00
1460.11 1460.11 1460.11 0.00 0.00 0.00 0.00 1.00 1.00 0.00
136564.86 136564.86 136564.86 136564.86 0.00 0.00 0.00 0.00 1.00 1.00
41581.29 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00
6138.26 6138.26 6138.26 0.00 0.00 0.00 0.00 1.00 0.00 0.00
23831.37 23831.37 23831.37 23831.37 0.00 0.00 0.00 0.00 1.00 1.00
4529.44 4529.44 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1291.53 1291.53 1291.53 0.00 0.00 0.00 0.00 1.00 0.00 0.00
1084.88 1084.88 1084.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00
33516.76 33516.76 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00
43393.83 43393.83 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00
81000.69 81000.69 81000.69 0.00 0.00 0.00 0.00 1.00 0.00 0.00
25397.64 25397.64 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00
29473.54 29473.54 29473.54 0.00 0.00 0.00 0.00 1.00 1.00 1.00
39097.70 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00
59669.99 59669.99 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
18639.97 18639.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
97198.13 97198.13 97198.13 0.00 0.00 0.00 0.00 1.00 0.00 1.00
5558.69 5558.69 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
16298.63 16298.63 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00
67621.61 67621.61 67621.61 0.00 0.00 0.00 0.00 1.00 0.00 0.00
69388.09 69388.09 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
193524.89 193524.89 193524.89 193524.89 0.00 0.00 0.00 0.00 1.00 1.00
12455.61 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00
7261.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00
77879.68 77879.68 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
53891.97 53891.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
70602.68 70602.68 70602.68 70602.68 70602.68 0.00 0.00 0.00 0.00 1.00
4157.96 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00
Cohort 1.00 2.00 3.00 4.00 5.00
Probability 0.10 0.30 0.20 0.25 0.15
Theoretical 142449.09 427347.26 284898.17 356122.71 213673.63
Simulated 142060.85 426554.86 285268.75 353921.12 216685.28
Difference 0.00 0.00 0.00 2201.59 3011.65
Upvotes: 2
Views: 192
Reputation: 269481
Using R, create some test input. Then using a greedy approach determine the ordering indices o
. Use findInterval
to determine the breakpoints b
and then create a grouping vector and rearrange it to correspond to the original order of x
so that x[i]
is in group g[i]
. Note that split(x, g)
creates a length(d)
list of groups.
# test input
set.seed(123)
x <- sample(20, 20)
d <- c(.2, .3, .5) # assume in increasing order
o <- order(x)
b <- findInterval(cumsum(d) * sum(x), cumsum(x[o]))
g <- rep(seq_along(d), diff(c(0, b)))[order(o)]
# check distribution of result
tapply(x, g, sum) / sum(x)
## 1 2 3
## 0.1714286 0.3285714 0.5000000
Upvotes: 2