Reputation: 10750
Imagine that you have n items and m bins. All items are identical, however the bins are distinct. What's the fastest algorithm to randomly select an allotment of items into bins?
For example - imagine 104
is a placement of 5 items into 3 bins. There are 21 possible placements of 5 items into 3 bins:
005 014 023
032 041 050
104 113 122
131 140 203
212 221 230
302 311 320
401 410 500
For large numbers of items and bins, how should I select a placement of n
items into m
bins such that each possible placement has an equal chance of occurring? For a given n
and m
selection will be done a large number of times.
Upvotes: 3
Views: 1078
Reputation: 22564
Here are two algorithms for two situations.
You have much memory available and you will do many placements. In this case, you can use your memory to pre-calculate and store all possible placements of your n
items into m
bins. If we let C(n, r)
be the number of combinations of r
items taken from n
items, without repetition and without regard to order, then the number of possible placements is C(m+n-1, m-1)
. (That formula is pretty standard in combinatorics and uses the stars and bars method). In your example, that is
C(5+3-1, 3-1) = C(7, 2) = 7!/2!/(7-2)! = 7/1*6/2 = 21
(If MathJax were available in StackOverflow I could make that look much more pretty with standard math notation.) In the setup to your program, those placements can be generated with a small recursive routine with this idea--place k
items (0 <= k <= n
) into the first bin, then place the remaining n-k
items into the remaining m-1
bins. Then each time you want a random placement, choose a random number between 1 and C(m+n-1, m-1)
and use that as an index to choose the placement. The time cost per additional placement is just one random number calculation and one array access. You can't get much better than that.
You have little memory available and you will do one or few placements. Then you can choose your random placement with an iterative routine that calculates multiple combination coefficients.
First, calculate the number of possible placement of your n
items into m
bins, C(m+n-1, m-1)
, and choose a random number r
from 1 to that combinations number. Let k
be the number of items to place in the first bin. There will then be n-k
remaining items in m-1
bins, which has a count of C(m+n-k-2, m-2)
. Now cycle k
starting at 0
. If that count for k=0
exceeds or equals r
, we decide to set k=0
items in the first bin. If not, increase k
by one and decrease r
by that combinations count, and find a new combinations count for the new k
. If that count exceeds or equals r
, we decide to set k
items in the first bin. If not, increase k
by one and ... you get the idea. When we have chosen a particular value of k
we replace n
with n-k
, m
with m-1
, leave r
as it now is, and move to the next bin.
The count for this is n
iterations through the items and m
iterations through the bins, for m+n
iterations and calculations of combination coefficients. The only memory usage is a few simple variables and the final placement into m
bins. You want a good combination coefficients calculator routine.
(ADDED LATER: I have fully coded this routine and found a better way to calculate the probabilities involved without finding the numbers of combinations. This reduces the calculation time and fully achieves the order m+n
for the routine. This could be reduced to order m
if I could find a good function to directly find a value that gives a certain probability, but I am unable to find such a function. I could find approximations, if you are willing to have a near-uniform distribution of placements rather than a fully uniform distribution.)
Let me know if you want some Python code demonstrating either routine, but clarify which situation you are in and show some more of your own effort first.
Upvotes: 2
Reputation: 12324
Five identical items can be distributed over three distinct bins in 35 = 243 ways, each leading to one of these 21 distributions:
500 410 320 311 221
050 401 302 131 212
005 140 230 113 122
104 203
041 032
014 023
You'll notice that there are two mechanisms at work here: firstly the number 5 can be partitioned into a maximum of 3 parts in 5 different ways (the columns), and secondly each such partition has a number of permutations (the rows).
To enumerate the partitions with restricted number of parts, use a recursive algorithm, e.g. with 5 items and 3 bins:
5 items in 1 bin
4 items in 1 bin + recurse with 1 item in 2 bins
3 items in 1 bin + recurse with 2 items in 2 bins
2 items in 1 bin + recurse with 3 items in 2 bins
Generate only non-rising sequences like [2,2,1], not [2,1,2] or [0,2,3], by never putting a number of items in the current bin that is smaller than the number of items divided by the number of bins (that's why there's no option with 1 item in the first bin in the example above).
(Partitioning can be speeded up by using memoization.)
To calculate the probability of each partition (i.e. the number of permutations), divide the factorial of the number of bins by the product of the factorials of the number of bins with a certain number of items:
5,0,0 3! / (1! x 2!) = 3
4,1,0 3! / (1! x 1! x 1!) = 6
3,2,0 3! / (1! x 1! x 1!) = 6
3,1,1 3! / (1! x 2!) = 3
2,2,1 3! / (2! x 1!) = 3
--
21
Then, choose a random number from 1 to 21, and select the corresponding partition and its permutation; e.g. choosing 13 would mean partition [3,2,0] and its fourth permutation [2,0,3].
So instead of enumerating all (243 in the 5:3 example) options or all (21) distributions, we enumerate the (5) partitions, and maybe the (up to 6) permutations if there's no faster way to get to the n-th unique permutation. For large numbers, this should make a huge difference.
(For details and code examples for some of these steps, see this answer to a related question.)
Upvotes: 0