Reputation: 957
I stumbled upon a basic discrete math/probability question and I wanted to get some ideas for improvements over my solution.
Assume you are given a collection (an alphabet, the natural numbers, etc.). How do you ensure that you draw a certain value X
from this collection with a given probability P
?
I'll explain my naïve solution with an example:
Collection = {A, B}
X = A, P = 1/4
We build an array v = [A, B, B, B]
and we use a rand
function to uniformly sample the indices of the array, i.e., {0, 1, 2, 3}
This approach works, but isn't efficient: the smaller P
, the bigger the memory storage of v
. Hence, I was wondering what ideas the stackoverflow community might have in improving this.
Thanks!
Upvotes: 1
Views: 222
Reputation: 7394
Your proposed solution is indeed not great, and the most general and efficient way to solve it is as mathematician1975 states (this is known as the inverse CDF method). For your specific problem, which is multinomial sampling, you can also use a series of draws from binomial distributions to sample from your collection. This is often more intuitive if you're not familiar with sampling methods.
If the first item in the collection has probability p_1, sample uniformly in the interval [0-1]. If the sample is less than p_1, return item 1. Otherwise, renormalise the remaining outcomes by 1-p_1 and repeat the process with the next possible outcome. After each unsuccessful sampling, renormalise remaining outcomes by the total probability of rejected outcomes, so that the sum of remaining outcomes is 1. If you get to the last outcome, return it with probability 1. The result of the process will be random samples distributed according to your original vector.
This method is using the fact that individual components of a multinomial are binomially distributed, and any sub vector of the multinomial is also multinomial with parameters given by the renormalisation I describe above.
Upvotes: 1
Reputation: 21351
Partition the interval [0,1] into disjoint intervals whose union is [0,1]. Create the size of each partition to correspond to the probability of selecting each event. Then simply sample randomly from [0,1], evaluate which of your partitions the result lies in, then look up the selection that corresponds to that interval. In your example, this would result in the following 2 intervals [0,1/4) and [1/4,1] - generate a random uniform value from [0,1]. If your sample lies in the first interval then your selection X = A
, if in the other interval then X = B
.
Upvotes: 5