Reputation: 5821
Imagine we have four symbols - 'a', 'b', 'c', 'd'. We also have four given probabilities of those symbols appearing in the function output - P1, P2, P3, P4 (the sum of which is equal to 1). How would one implement a function which would generate a random sample of three of those symbols, such is that the resulting symbols are present in it with those specified probabilities?
Example: 'a', 'b', 'c' and 'd' have the probabilities of 9/30, 8/30, 7/30 and 6/30 respectively. The function outputs various random samples of any three out of those four symbols: 'abc', 'dca', 'bad' and so on. We run this function many times, counting the amount of times each of the symbols is encountered in its output. At the end, the value of counts stored for 'a' divided by the total amount of symbols output should converge to 9/30, for 'b' to 8/30, for 'c' to 7/30, and for 'd' to 6/30.
E.g. the function generates 10 outputs:
adc
dab
bca
dab
dba
cab
dcb
acd
cab
abc
which out of 30 symbols contains 9 of 'a', 8 of 'b', 7 of 'c' and 6 of 'd'. This is an idealistic example, of course, as the values would only converge when the number of samples is much larger - but it should hopefully convey the idea.
Obviously, this all is only possible when neither probability is larger than 1/3, since each single sample output would always contain three distinct symbols. It is ok for the function to enter an infinite loop or otherwise behave erratically if it's impossible to satisfy the values provided.
Note: the function should obviously use an RNG, but should otherwise be stateless. Each new invocation should be independent from any of the previous ones, except for the RNG state.
EDIT: Even though the description mentions choosing 3 out of 4 values, ideally the algorithm should be able to cope with any sample size.
Upvotes: 2
Views: 626
Reputation: 590
Assuming that the length of a word is always one less than the number of symbols, the following C# code does the job:
using System;
using System.Collections.Generic;
using System.Linq;
using MathNet.Numerics.Distributions;
namespace RandomSymbols
{
class Program
{
static void Main(string[] args)
{
// Sample case: Four symbols with the following distribution, and 10000 trials
double[] distribution = { 9.0/30, 8.0/30, 7.0/30, 6.0/30 };
int trials = 10000;
// Create an array containing all of the symbols
char[] symbols = Enumerable.Range('a', distribution.Length).Select(s => (char)s).ToArray();
// We're assuming that the word length is always one less than the number of symbols
int wordLength = symbols.Length - 1;
// Calculate the probability of each symbol being excluded from a given word
double[] excludeDistribution = Array.ConvertAll(distribution, p => 1 - p * wordLength);
// Create a random variable using the MathNet.Numerics library
var randVar = new Categorical(excludeDistribution);
var random = new Random();
randVar.RandomSource = random;
// We'll store all of the words in an array
string[] words = new string[trials];
for (int t = 0; t < trials; t++)
{
// Start with a word containing all of the symbols
var word = new List<char>(symbols);
// Remove one of the symbols
word.RemoveAt(randVar.Sample());
// Randomly permute the remainder
for (int i = 0; i < wordLength; i++)
{
int swapIndex = random.Next(wordLength);
char temp = word[swapIndex];
word[swapIndex] = word[i];
word[i] = temp;
}
// Store the word
words[t] = new string(word.ToArray());
}
// Display words
Array.ForEach(words, w => Console.WriteLine(w));
// Display histogram
Array.ForEach(symbols, s => Console.WriteLine("{0}: {1}", s, words.Count(w => w.Contains(s))));
}
}
}
Update: The following is a C implementation of the method that rici outlined. The tricky part is calculating the thresholds that he mentions, which I did with recursion.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// ****** Change the following for different symbol distributions, word lengths, and number of trials ******
double targetFreqs[] = {10.0/43, 9.0/43, 8.0/43, 7.0/43, 6.0/43, 2.0/43, 1.0/43 };
const int WORDLENGTH = 4;
const int TRIALS = 1000000;
// *********************************************************************************************************
const int SYMBOLCOUNT = sizeof(targetFreqs) / sizeof(double);
double inclusionProbs[SYMBOLCOUNT];
double probLeftToIncludeTable[SYMBOLCOUNT][SYMBOLCOUNT];
// Calculates the probability that there will be n symbols left to be included when we get to the ith symbol.
double probLeftToInclude(int i, int n)
{
if (probLeftToIncludeTable[i][n] == -1)
{
// If this is the first symbol, then the number of symbols left to be included is simply the word length.
if (i == 0)
{
probLeftToIncludeTable[i][n] = (n == WORDLENGTH ? 1.0 : 0.0);
}
else
{
// Calculate the probability based on the previous symbol's probabilities.
// To get to a point where there are n symbols left to be included, either there were n+1 symbols left
// when we were considering that previous symbol and we included it, leaving n,
// or there were n symbols left and we didn't included it, also leaving n.
// We have to take into account that the previous symbol may have been manditorily included.
probLeftToIncludeTable[i][n] = probLeftToInclude(i-1, n+1) * (n == SYMBOLCOUNT-i ? 1.0 : inclusionProbs[i-1])
+ probLeftToInclude(i-1, n) * (n == 0 ? 1.0 : 1 - inclusionProbs[i-1]);
}
}
return probLeftToIncludeTable[i][n];
}
// Calculates the probability that the ith symbol won't *have* to be included or *have* to be excluded.
double probInclusionIsOptional(int i)
{
// The probability that inclusion is optional is equal to 1.0
// minus the probability that none of the remaining symbols can be included
// minus the probability that all of the remaining symbols must be included.
return 1.0 - probLeftToInclude(i, 0) - probLeftToInclude(i, SYMBOLCOUNT - i);
}
// Calculates the probability with which the ith symbol should be included, assuming that
// it doesn't *have* to be included or *have* to be excluded.
double inclusionProb(int i)
{
// The following is derived by simple algebra:
// Unconditional probability = (1.0 * probability that it must be included) + (inclusionProb * probability that inclusion is optional)
// therefore...
// inclusionProb = (Unconditional probability - probability that it must be included) / probability that inclusion is optional
return (targetFreqs[i]*WORDLENGTH - probLeftToInclude(i, SYMBOLCOUNT - i)) / probInclusionIsOptional(i);
}
int main(int argc, char* argv[])
{
srand(time(NULL));
// Initialize inclusion probabilities
for (int i=0; i<SYMBOLCOUNT; i++)
for (int j=0; j<SYMBOLCOUNT; j++)
probLeftToIncludeTable[i][j] = -1.0;
// Calculate inclusion probabilities
for (int i=0; i<SYMBOLCOUNT; i++)
{
inclusionProbs[i] = inclusionProb(i);
}
// Histogram
int histogram[SYMBOLCOUNT];
for (int i=0; i<SYMBOLCOUNT; i++)
{
histogram[i] = 0;
}
// Scratchpad for building our words
char word[WORDLENGTH+1];
word[WORDLENGTH] = '\0';
// Run trials
for (int t=0; t<TRIALS; t++)
{
int included = 0;
// Build the word by including or excluding symbols according to the problem constraints
// and the probabilities in inclusionProbs[].
for (int i=0; i<SYMBOLCOUNT && included<WORDLENGTH; i++)
{
if (SYMBOLCOUNT - i == WORDLENGTH - included // if we have to include this symbol
|| (double)rand()/(double)RAND_MAX < inclusionProbs[i]) // or if we get a lucky roll of the dice
{
word[included++] = 'a' + i;
histogram[i]++;
}
}
// Randomly permute the word
for (int i=0; i<WORDLENGTH; i++)
{
int swapee = rand() % WORDLENGTH;
char temp = word[swapee];
word[swapee] = word[i];
word[i] = temp;
}
// Uncomment the following to show each word
// printf("%s\r\n", word);
}
// Show the histogram
for (int i=0; i<SYMBOLCOUNT; i++)
{
printf("%c: target=%d, actual=%d\r\n", 'a'+i, (int)(targetFreqs[i]*WORDLENGTH*TRIALS), histogram[i]);
}
return 0;
}
Upvotes: 1
Reputation: 1948
Your problem is underdetermined.
If we assign a probability to each string of three letters that we allow, p(abc), p(abd), p(acd) etc xtc we can gernerate a series of equations
eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1
...
...
eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4
This has more unknowns than equations, so many ways of solving it. Once a solution is found, by whatever method you choose (use the simplex algorithm if you are me), sample from the probabilities of each string using the roulette method that @alestanis describes.
from numpy import *
# using cvxopt-1.1.5
from cvxopt import matrix, solvers
###########################
# Functions to do some parts
# function to find all valid outputs
def perms(alphabet, length):
if length == 0:
yield ""
return
for i in range(len(alphabet)):
val1 = alphabet[i]
for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1):
yield val1 + val2
# roulette sampler
def roulette_sampler(values, probs):
# Create cumulative prob distro
probs_cum = [sum(probs[:i+1]) for i in range(n_strings)]
def fun():
r = random.rand()
for p,s in zip(probs_cum, values):
if r < p:
return s
# in case of rounding error
return values[-1]
return fun
############################
# Main Part
# create list of all valid strings
alphabet = "abcd"
string_length = 3
alpha_probs = [string_length*x/30. for x in range(9,5,-1)]
# show probs
for a,p in zip(alphabet, alpha_probs):
print "p("+a+") =",p
# all valid outputs for this particular case
strings = [perm for perm in perms(alphabet, string_length)]
n_strings = len(strings)
# constraints from probabilities p(abc) + p(abd) ... = p(a)
contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings])
#both = concatenate((contains,wons), axis=1).T # hacky, but whatever
#A = matrix(both)
#b = matrix(alpha_probs + [1.])
A = matrix(contains.T)
b = matrix(alpha_probs)
#also need to constrain to [0,1]
wons = array([[1. for s in strings]])
G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons)))
h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1))))
## target matricies for approx KL divergence
# uniform prob over valid outputs
u = 1./len(strings)
P = matrix(eye(n_strings))
q = -0.5*u*matrix(ones(n_strings))
# will minimise p^2 - pq for each p val equally
# Do convex optimisation
sol = solvers.qp(P,q,G,h,A,b)
probs = array(sol['x'])
# Print ouput
for s,p in zip(strings,probs):
print "p("+s+") =",p
checkprobs = [0. for char in alphabet]
for a,i in zip(alphabet, range(len(alphabet))):
for s,p in zip(strings,probs):
if s.find(a) > -1:
checkprobs[i] += p
print "p("+a+") =",checkprobs[i]
print "total =",sum(probs)
# Create the sampling function
rndstring = roulette_sampler(strings, probs)
###################
# Verify
print "sampling..."
test_n = 1000
output = [rndstring() for i in xrange(test_n)]
# find which one it is
sampled_freqs = []
for char in alphabet:
n = 0
for val in output:
if val.find(char) > -1:
n += 1
sampled_freqs += [n]
print "plotting histogram..."
import matplotlib.pyplot as plt
plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5)
plt.show()
EDIT: Python code
Upvotes: 1
Reputation: 241701
I think this is a pretty interesting problem. I don't know the general solution, but it's easy enough to solve in the case of samples of size n-1 (if there is a solution), since there are exactly n possible samples, each of which corresponds to the absence of one of the elements.
Suppose we are seeking Fa = 9/30, Fb = 8/30, Fc = 7/30, Fd = 6/30 in samples of size 3 from a universe of size 4, as in the OP. We can translate each of those frequencies directly into a frequency of samples by selecting the samples which do not contain the given object. For example, we wish 9/30 of the selected objects to be a
's; we cannot have more than one a
in a sample, and we always have three symbols in a sample; consequently, 9/10 of the samples must contain a
and 1/10 cannot contain a
. But there is only one possible sample which doesn't contain a
: bcd
. So 10% of the samples must be bcd
. Similarly, 20% must be acd
; 30% abd
and 40% abc
. (Or, more generally, Fā = 1 - (n-1)Fa where Fā is the frequency of the (unique) sample which does not include a
)
I can't help thinking that this observation combined with one of the classic ways of generating unique samples can solve the general problem. But I don't have that solution. For what it's worth, the algorithm I'm thinking of is the following:
To select a random sample of size k out of a universe U of n objects:
1) Set needed = k; available = n.
2) For each element in U, select a random number in the range [0, 1).
3) If the random number is less than k/n:
3a) Add the element to the sample.
3b) Decrement needed by 1. If it reaches 0, we're finished.
4) Decrement available, and continue with the next element in U.
So my idea is that it should be possible to manipulate the frequency of element by changing the threshold in step 3, making it somehow a function of the desired frequency of the corresponding element.
Upvotes: 1
Reputation: 21863
To do this you have to use a temporary array storing the cumulated sum of your probabilities.
In your example, probabilities are 9/30, 8/30, 7/30 and 6/30 respectively. You should then have an array:
values = {'a', 'b', 'c', 'd'}
proba = {9/30, 17/30, 24/30, 1}
Then you pick a random number r
in [0, 1]
and do like this:
char chooseRandom() {
int i = 0;
while (r > proba[i])
++i;
return values[i];
}
Upvotes: 0