Reputation: 5821

Random sample of values with the specified resulting probabilities

Imagine we have four symbols - 'a', 'b', 'c', 'd'. We also have four given probabilities of those symbols appearing in the function output - P1, P2, P3, P4 (the sum of which is equal to 1). How would one implement a function which would generate a random sample of three of those symbols, such is that the resulting symbols are present in it with those specified probabilities?

Example: 'a', 'b', 'c' and 'd' have the probabilities of 9/30, 8/30, 7/30 and 6/30 respectively. The function outputs various random samples of any three out of those four symbols: 'abc', 'dca', 'bad' and so on. We run this function many times, counting the amount of times each of the symbols is encountered in its output. At the end, the value of counts stored for 'a' divided by the total amount of symbols output should converge to 9/30, for 'b' to 8/30, for 'c' to 7/30, and for 'd' to 6/30.

E.g. the function generates 10 outputs:

adc
dab
bca
dab
dba
cab
dcb
acd
cab
abc

which out of 30 symbols contains 9 of 'a', 8 of 'b', 7 of 'c' and 6 of 'd'. This is an idealistic example, of course, as the values would only converge when the number of samples is much larger - but it should hopefully convey the idea.

Obviously, this all is only possible when neither probability is larger than 1/3, since each single sample output would always contain three distinct symbols. It is ok for the function to enter an infinite loop or otherwise behave erratically if it's impossible to satisfy the values provided.

Note: the function should obviously use an RNG, but should otherwise be stateless. Each new invocation should be independent from any of the previous ones, except for the RNG state.

EDIT: Even though the description mentions choosing 3 out of 4 values, ideally the algorithm should be able to cope with any sample size.

Upvotes: 2

Answers (4)

Kevin

Reputation: 590

Assuming that the length of a word is always one less than the number of symbols, the following C# code does the job:

using System;
using System.Collections.Generic;
using System.Linq;
using MathNet.Numerics.Distributions;

namespace RandomSymbols
{
    class Program
    {
        static void Main(string[] args)
        {
            // Sample case:  Four symbols with the following distribution, and 10000 trials
            double[] distribution = { 9.0/30, 8.0/30, 7.0/30, 6.0/30 };
            int trials = 10000;

            // Create an array containing all of the symbols
            char[] symbols = Enumerable.Range('a', distribution.Length).Select(s => (char)s).ToArray();

            // We're assuming that the word length is always one less than the number of symbols
            int wordLength = symbols.Length - 1;

            // Calculate the probability of each symbol being excluded from a given word
            double[] excludeDistribution = Array.ConvertAll(distribution, p => 1 - p * wordLength);

            // Create a random variable using the MathNet.Numerics library
            var randVar = new Categorical(excludeDistribution);
            var random = new Random();
            randVar.RandomSource = random;

            // We'll store all of the words in an array
            string[] words = new string[trials];

            for (int t = 0; t < trials; t++)
            {
                // Start with a word containing all of the symbols
                var word = new List<char>(symbols);

                // Remove one of the symbols
                word.RemoveAt(randVar.Sample());

                // Randomly permute the remainder
                for (int i = 0; i < wordLength; i++)
                {
                    int swapIndex = random.Next(wordLength);
                    char temp = word[swapIndex];
                    word[swapIndex] = word[i];
                    word[i] = temp;
                }

                // Store the word
                words[t] = new string(word.ToArray());
            }

            // Display words
            Array.ForEach(words, w => Console.WriteLine(w));

            // Display histogram
            Array.ForEach(symbols, s => Console.WriteLine("{0}: {1}", s, words.Count(w => w.Contains(s))));
        }

    }
}

Update: The following is a C implementation of the method that rici outlined. The tricky part is calculating the thresholds that he mentions, which I did with recursion.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// ****** Change the following for different symbol distributions, word lengths, and number of trials ******
double targetFreqs[] = {10.0/43, 9.0/43, 8.0/43, 7.0/43, 6.0/43, 2.0/43, 1.0/43 };
const int WORDLENGTH = 4;
const int TRIALS = 1000000;
// *********************************************************************************************************

const int SYMBOLCOUNT = sizeof(targetFreqs) / sizeof(double);
double inclusionProbs[SYMBOLCOUNT];

double probLeftToIncludeTable[SYMBOLCOUNT][SYMBOLCOUNT];

// Calculates the probability that there will be n symbols left to be included when we get to the ith symbol.
double probLeftToInclude(int i, int n)
{
    if (probLeftToIncludeTable[i][n] == -1)
    {
        // If this is the first symbol, then the number of symbols left to be included is simply the word length.
        if (i == 0)
        {
            probLeftToIncludeTable[i][n] = (n == WORDLENGTH ? 1.0 : 0.0);
        }
        else
        {
            // Calculate the probability based on the previous symbol's probabilities.
            // To get to a point where there are n symbols left to be included, either there were n+1 symbols left
            // when we were considering that previous symbol and we included it, leaving n,
            // or there were n symbols left and we didn't included it, also leaving n.
            // We have to take into account that the previous symbol may have been manditorily included.
             probLeftToIncludeTable[i][n] = probLeftToInclude(i-1, n+1) * (n == SYMBOLCOUNT-i ? 1.0 : inclusionProbs[i-1])
                + probLeftToInclude(i-1, n) * (n == 0 ? 1.0 : 1 - inclusionProbs[i-1]);
        }
    }
    return probLeftToIncludeTable[i][n];
}

// Calculates the probability that the ith symbol won't *have* to be included or *have* to be excluded.
double probInclusionIsOptional(int i)
{
    // The probability that inclusion is optional is equal to 1.0
    // minus the probability that none of the remaining symbols can be included
    // minus the probability that all of the remaining symbols must be included.
    return 1.0 - probLeftToInclude(i, 0) - probLeftToInclude(i, SYMBOLCOUNT - i);
}

// Calculates the probability with which the ith symbol should be included, assuming that
// it doesn't *have* to be included or *have* to be excluded.
double inclusionProb(int i)
{
    // The following is derived by simple algebra:
    // Unconditional probability = (1.0 * probability that it must be included) + (inclusionProb * probability that inclusion is optional)
    // therefore...
    // inclusionProb = (Unconditional probability - probability that it must be included) / probability that inclusion is optional
    return (targetFreqs[i]*WORDLENGTH - probLeftToInclude(i, SYMBOLCOUNT - i)) / probInclusionIsOptional(i);
}

int main(int argc, char* argv[])
{
    srand(time(NULL));

    // Initialize inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
        for (int j=0; j<SYMBOLCOUNT; j++)
            probLeftToIncludeTable[i][j] = -1.0;

    // Calculate inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        inclusionProbs[i] = inclusionProb(i);
    }

    // Histogram
    int histogram[SYMBOLCOUNT];
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        histogram[i] = 0;
    }

    // Scratchpad for building our words
    char word[WORDLENGTH+1];
    word[WORDLENGTH] = '\0';

    // Run trials
    for (int t=0; t<TRIALS; t++)
    {
        int included = 0;

        // Build the word by including or excluding symbols according to the problem constraints
        // and the probabilities in inclusionProbs[].
        for (int i=0; i<SYMBOLCOUNT && included<WORDLENGTH; i++)
        {
            if (SYMBOLCOUNT - i == WORDLENGTH - included // if we have to include this symbol
                || (double)rand()/(double)RAND_MAX < inclusionProbs[i]) // or if we get a lucky roll of the dice
            {
                word[included++] = 'a' + i;
                histogram[i]++;
            }
        }

        // Randomly permute the word
        for (int i=0; i<WORDLENGTH; i++)
        {
            int swapee = rand() % WORDLENGTH;
            char temp = word[swapee];
            word[swapee] = word[i];
            word[i] = temp;
        }

        // Uncomment the following to show each word
        // printf("%s\r\n", word);
    }

    // Show the histogram
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        printf("%c: target=%d, actual=%d\r\n", 'a'+i, (int)(targetFreqs[i]*WORDLENGTH*TRIALS), histogram[i]);
    }

    return 0;
}

Upvotes: 1

Lucas

Reputation: 1948

Your problem is underdetermined.

If we assign a probability to each string of three letters that we allow, p(abc), p(abd), p(acd) etc xtc we can gernerate a series of equations

eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1
  ...
  ...
eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4

This has more unknowns than equations, so many ways of solving it. Once a solution is found, by whatever method you choose (use the simplex algorithm if you are me), sample from the probabilities of each string using the roulette method that @alestanis describes.

from numpy import *

# using cvxopt-1.1.5
from cvxopt import matrix, solvers 

###########################
# Functions to do some parts

# function to find all valid outputs
def perms(alphabet, length):
    if length == 0:
        yield ""
        return
    for i in range(len(alphabet)):
        val1 = alphabet[i]
        for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1):
            yield val1 + val2


# roulette sampler
def roulette_sampler(values, probs):
    # Create cumulative prob distro
    probs_cum = [sum(probs[:i+1]) for i in range(n_strings)]
    def fun():
        r = random.rand()
        for p,s in zip(probs_cum, values):
            if r < p:
                return s
        # in case of rounding error
        return values[-1]
    return fun


############################
#    Main Part



# create list of all valid strings

alphabet = "abcd"
string_length = 3
alpha_probs = [string_length*x/30. for x in range(9,5,-1)]

# show probs
for a,p in zip(alphabet, alpha_probs):
    print "p("+a+") =",p




# all valid outputs for this particular case
strings = [perm for perm in perms(alphabet, string_length)]
n_strings = len(strings)

# constraints from probabilities p(abc) + p(abd) ... = p(a)
contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings])
#both = concatenate((contains,wons), axis=1).T # hacky, but whatever
#A = matrix(both)
#b = matrix(alpha_probs + [1.])
A = matrix(contains.T)
b = matrix(alpha_probs)

#also need to constrain to [0,1]
wons = array([[1. for s in strings]])
G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons)))
h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1))))

## target matricies for approx KL divergence
# uniform prob over valid outputs
u = 1./len(strings)
P = matrix(eye(n_strings))
q = -0.5*u*matrix(ones(n_strings))
# will minimise p^2 - pq for each p val equally


# Do convex optimisation
sol = solvers.qp(P,q,G,h,A,b)
probs = array(sol['x'])

# Print ouput
for s,p in zip(strings,probs):
    print "p("+s+") =",p
checkprobs = [0. for char in alphabet]
for a,i in zip(alphabet, range(len(alphabet))):
    for s,p in zip(strings,probs):
        if s.find(a) > -1:
            checkprobs[i] += p
    print "p("+a+") =",checkprobs[i]
print "total =",sum(probs)

# Create the sampling function
rndstring = roulette_sampler(strings, probs)


###################
# Verify

print "sampling..."
test_n = 1000
output = [rndstring() for i in xrange(test_n)]

# find which one it is
sampled_freqs = []
for char in alphabet:
    n = 0
    for val in output:
        if val.find(char) > -1:
            n += 1
    sampled_freqs += [n]

print "plotting histogram..."
import matplotlib.pyplot as plt
plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5)
plt.show()

EDIT: Python code

Upvotes: 1

rici

Reputation: 241701

I think this is a pretty interesting problem. I don't know the general solution, but it's easy enough to solve in the case of samples of size n-1 (if there is a solution), since there are exactly n possible samples, each of which corresponds to the absence of one of the elements.

Suppose we are seeking F_a = 9/30, F_b = 8/30, F_c = 7/30, F_d = 6/30 in samples of size 3 from a universe of size 4, as in the OP. We can translate each of those frequencies directly into a frequency of samples by selecting the samples which do not contain the given object. For example, we wish 9/30 of the selected objects to be a's; we cannot have more than one a in a sample, and we always have three symbols in a sample; consequently, 9/10 of the samples must contain a and 1/10 cannot contain a. But there is only one possible sample which doesn't contain a: bcd. So 10% of the samples must be bcd. Similarly, 20% must be acd; 30% abd and 40% abc. (Or, more generally, F_ā = 1 - (n-1)F_a where F_ā is the frequency of the (unique) sample which does not include a)

I can't help thinking that this observation combined with one of the classic ways of generating unique samples can solve the general problem. But I don't have that solution. For what it's worth, the algorithm I'm thinking of is the following:

To select a random sample of size k out of a universe U of n objects:
1) Set needed = k; available = n.
2) For each element in U, select a random number in the range [0, 1).
3) If the random number is less than k/n:
     3a) Add the element to the sample.
     3b) Decrement needed by 1. If it reaches 0, we're finished.
4) Decrement available, and continue with the next element in U.

So my idea is that it should be possible to manipulate the frequency of element by changing the threshold in step 3, making it somehow a function of the desired frequency of the corresponding element.

Upvotes: 1

alestanis

Reputation: 21863

To do this you have to use a temporary array storing the cumulated sum of your probabilities.

In your example, probabilities are 9/30, 8/30, 7/30 and 6/30 respectively. You should then have an array:

values = {'a', 'b', 'c', 'd'}
proba = {9/30, 17/30, 24/30, 1}

Then you pick a random number r in [0, 1] and do like this:

char chooseRandom() {
    int i = 0;
    while (r > proba[i])
        ++i; 

    return values[i];
}

Upvotes: 0

Random sample of values with the specified resulting probabilities

Answers (4)

Related Questions