How to effectively fit a set of numbers into a predetermined distribution

Question

Say I have a set of numbers and I want to sum them to fit cohorts based on a predetermined distribution. A simple example would be if the cumulative amount of a set of numbers is 100 and the distribution is 0.2, 0.3, 0.5 for cohorts 1,2 and 3 respectively then I'd want to find a subset of numbers whose sum is 20, another unique subset whose sum is 30 and a final unique subset whose sum is 50. Obviously it doesn't have to be exact it just must match the distribution reasonably closely.

I have a way in vba in which I can utilize the solver addin to find the best way to take a subset of numbers within a set of numbers and get close to (within 3000 say) a predetermined distribution. This involve using a sumproduct with a binary 0,1 constraint and the list of numbers, and then finding the difference between the total amount required at that cohort and the sumproduct described.

Once this has been done, any number in this subset is removed and we perform the solver method with a shrunk subset. I have attached an image of the procedure's evolution that I hope is clear, the colors correspond to the iteration. The first (green) iteration we have the full list and the variables that change are those in the corresponding green column containing 0/1, to get a sum product close to 142,449.09

Note the sum of the full list is: 1,424,490.85 in this example.

The "Difference" line is the solver objective, and after each iteration the objective is one column shift to the right. (I have set it so that if the difference is within 1000 then it displays zero - as this appeared to speed up the method). The simulated is that calculated from the corresponding colored sumproduct and the theoretical is simply just the probability multiplied by the total sum of all the numbers.

I have attached the code below but in reality this method isn't time effective, especially if I have to do this across multiple datasets - which is the reality of the problem. I'd like to be able to move this project into a more efficient language like R (which I've had experience with - albeit not to a high degree), as I believe it could do this process quicker and more effectively?

I am also aware my algorithm has flaws as some of the later cohorts won't be as accurate as we are observing a smaller dataset. It seems to include zeros in the sum product which I don't want it to do (see grey column). Also I want all the numbers to be used and sometimes a number will be omitted as it's inclusion means it is further away from the theoretical distribution. I am not sure how to do the above so I'd appreciate some advice on this front.

Has anyone done such a thing in R?

I also understand this could be a problem for Cross Validated - I really wasn't sure so feel free to move. I have attached the code and the tables in text form below.

Thanks in advance,

  Sub solversimple()

Dim wb As Workbook
Dim ws As Worksheet
Dim rCell, rIter, rSum
Dim i as Integer

Set wb = Application.ThisWorkbook

Set ws = wb.Sheets("Output")


For i = 1 To 5
    rCell = ws.Range("q8").Offset(0, i - 1).Address
    rChange = ws.Range("h4:h36").Offset(0, i - 1).Address
    rSum = ws.Range("I5:I39").Offset(0, i - 1).Address
    solverreset

        SolverOk SetCell:=rCell, MaxMinVal:=2, ValueOf:=0, ByChange:=rChange, _
            Engine:=3, EngineDesc:="Evolutionary"
        SolverAdd CellRef:=rChange, Relation:=5, FormulaText:="binary"
        SolverAdd CellRef:=rSum, Relation:=5, FormulaText:="binary"
        SolverSolve True

Next i



End Sub




    Full List   List after 1st It   List after 2nd  List after 3rd  List after 4th  1   2   3   4   5
49000.21    49000.21    49000.21    49000.21    49000.21    0.00    0.00    0.00    0.00    1.00
51591.99    51591.99    51591.99    51591.99    51591.99    0.00    0.00    0.00    0.00    1.00
18390.18    18390.18    0.00    0.00    0.00    0.00    1.00    1.00    0.00    1.00
45490.39    45490.39    45490.39    45490.39    45490.39    0.00    0.00    0.00    0.00    1.00
37506.41    0.00    0.00    0.00    0.00    1.00    0.00    0.00    0.00    1.00
1460.11 1460.11 1460.11 0.00    0.00    0.00    0.00    1.00    1.00    0.00
136564.86   136564.86   136564.86   136564.86   0.00    0.00    0.00    0.00    1.00    1.00
41581.29    0.00    0.00    0.00    0.00    1.00    0.00    1.00    0.00    0.00
6138.26 6138.26 6138.26 0.00    0.00    0.00    0.00    1.00    0.00    0.00
23831.37    23831.37    23831.37    23831.37    0.00    0.00    0.00    0.00    1.00    1.00
4529.44 4529.44 0.00    0.00    0.00    0.00    1.00    1.00    1.00    1.00
1291.53 1291.53 1291.53 0.00    0.00    0.00    0.00    1.00    0.00    0.00
1084.88 1084.88 1084.88 0.00    0.00    0.00    0.00    1.00    0.00    0.00
33516.76    33516.76    0.00    0.00    0.00    0.00    1.00    0.00    1.00    0.00
43393.83    43393.83    0.00    0.00    0.00    0.00    1.00    1.00    0.00    0.00
81000.69    81000.69    81000.69    0.00    0.00    0.00    0.00    1.00    0.00    0.00
25397.64    25397.64    0.00    0.00    0.00    0.00    1.00    0.00    0.00    1.00
29473.54    29473.54    29473.54    0.00    0.00    0.00    0.00    1.00    1.00    1.00
39097.70    0.00    0.00    0.00    0.00    1.00    0.00    0.00    1.00    1.00
59669.99    59669.99    0.00    0.00    0.00    0.00    1.00    1.00    1.00    1.00
18639.97    18639.97    0.00    0.00    0.00    0.00    1.00    0.00    0.00    0.00
97198.13    97198.13    97198.13    0.00    0.00    0.00    0.00    1.00    0.00    1.00
5558.69 5558.69 0.00    0.00    0.00    0.00    1.00    1.00    1.00    0.00
16298.63    16298.63    0.00    0.00    0.00    0.00    1.00    0.00    1.00    1.00
67621.61    67621.61    67621.61    0.00    0.00    0.00    0.00    1.00    0.00    0.00
69388.09    69388.09    0.00    0.00    0.00    0.00    1.00    1.00    1.00    0.00
193524.89   193524.89   193524.89   193524.89   0.00    0.00    0.00    0.00    1.00    1.00
12455.61    0.00    0.00    0.00    0.00    1.00    1.00    0.00    0.00    1.00
7261.88 0.00    0.00    0.00    0.00    1.00    0.00    0.00    0.00    0.00
77879.68    77879.68    0.00    0.00    0.00    0.00    1.00    1.00    0.00    1.00
53891.97    53891.97    0.00    0.00    0.00    0.00    1.00    0.00    0.00    0.00
70602.68    70602.68    70602.68    70602.68    70602.68    0.00    0.00    0.00    0.00    1.00
4157.96 0.00    0.00    0.00    0.00    1.00    1.00    1.00    0.00    1.00






Cohort  1.00    2.00    3.00    4.00    5.00
Probability 0.10    0.30    0.20    0.25    0.15
Theoretical 142449.09   427347.26   284898.17   356122.71   213673.63

Simulated   142060.85   426554.86   285268.75   353921.12   216685.28
Difference  0.00    0.00    0.00    2201.59 3011.65

G. Grothendieck · Accepted Answer

Using R, create some test input. Then using a greedy approach determine the ordering indices o. Use findInterval to determine the breakpoints b and then create a grouping vector and rearrange it to correspond to the original order of x so that x[i] is in group g[i]. Note that split(x, g) creates a length(d) list of groups.

# test input
set.seed(123)
x <- sample(20, 20)
d <- c(.2, .3, .5) # assume in increasing order

o <- order(x)
b <- findInterval(cumsum(d) * sum(x), cumsum(x[o]))
g <- rep(seq_along(d), diff(c(0, b)))[order(o)]

# check distribution of result
tapply(x, g, sum) / sum(x)
##         1         2         3 
## 0.1714286 0.3285714 0.5000000

How to effectively fit a set of numbers into a predetermined distribution

Answers (1)

Related Questions