Reputation: 1123
Let's say I have a list of items in which each item can be repeated once or more than once (1 to n). I'm trying to find an algorithm that extracts random items from the list until it's empty, but with the constraint that an item cannot be repeated consecutively more than a fixed number of times (which could be different for each item). I would like the algorithm to have "correct probabilities" (I try to explain that later).
For example, say I have this list:
Item | Count | Max. consecutive
-----+-------+-----------------
A | 2 | 1
B | 4 | 2
Some results could be:
B A B A B B
B B A B A B
But the following would be incorrect, since "B" has 3 consecutive repetitions, when the maximum was 2:
B B B A B A
I've managed to create an algorithm that works, but it has a problem with the probabilities. I will put the code first, and then talk about the problem. It's a C# class; sorry if you don't like the format or the GOTOs, let's be friends ;P
To use it, you instantiate an object providing the number of unique items, use SetIndexData()
to add the information for each item (number of elements and max. repetitions), and call GetNextRandomIndex()
to get the next random item (I have implemented it using int
as if the items were indexes, you would have to create an indexed collection to store the real items' values).
You can use the method CheckIndexMaxConsecutiveCorrectness()
to check if it's possible to have an item with the specified maximum repetitions in a bag with the specified total elements. For example, if a bag has only 1 item with 2 elements, but we say it can only be repeated once, it's obviously impossible, so CheckIndexMaxConsecutiveCorrectness()
would return false
.
The method GetIndexMinMaxGroups()
is used internally by the class to retrieve the minimum and maximum number of groups of consecutive elements for an item (for example, in the sequence B B B A B A
we have 2 groups for "A" and 2 for "B"). Actually, the numbers calculated in this method are not exactly true, since there are some variables that it doesn't take into account, but the values it returns are useful for two things: doing the check from CheckIndexMaxConsecutiveCorrectness()
(if the minimum number of groups is greater than the maximum, then the maximum number of repetitions is not correct), and knowing when it's obligatory to return a determined item from GetNextRandomIndex()
(when the minimum is equal to the maximum). I haven't used any mathematical algorithm to deduce those properties, so something could be wrong, though so far it has "magically" worked perfectly in my tests, millions of times...
class ShuffleBag
{
/*** INFORMATION FOR THE INDEXES ***/
// Class for the data:
class IndexData
{
public int Count = 0; // The current number of instances of the index in the bag.
public int MaxConsecutive = int.MaxValue; // The maximum consecutive repetitions allowed for the index.
}
// List of indexes data for this bag:
IndexData[] IndexesDataList;
/*** MEMBERS USED FOR CALCULATIONS ***/
// Random number generator:
Random RandGenerator;
// Remaining elements in the bag:
int _RemainingElementsCount = 0;
public int RemainingElementsCount
{
get { return _RemainingElementsCount; }
}
// Last retrieved index (-1 if no index has been retrieved yet),
// and the last consecutive repetitions of that index:
int LastIndex = -1;
int LastRepetitions = 0;
///==============///
/// Constructor. ///
///==============///
public ShuffleBag(int uniqueIndexesCount)
{
IndexesDataList = new IndexData[uniqueIndexesCount];
for (int i = 0; i < uniqueIndexesCount; i++)
IndexesDataList[i] = new IndexData();
RandGenerator = new Random();
}
///===========================================================///
/// Resets the shuffle bag; must be called before reusing it. ///
/// The number of unique indexes won't be reset. ///
/// Doesn't need to be called just after creating the bag. ///
///===========================================================///
public void Reset()
{
for (int i = 0; i < IndexesDataList.Length; i++)
{
IndexesDataList[i].Count = 0;
IndexesDataList[i].MaxConsecutive = int.MaxValue;
}
_RemainingElementsCount = 0;
LastIndex = -1;
LastRepetitions = 0;
}
///==================================================================================================///
/// Checks if it's possible to honor the max repetitions of an index with the provided data. ///
/// If it was not possible, the behaviour of a shuffle bag with those parameters would be undefined. ///
///==================================================================================================///
public static bool CheckIndexMaxConsecutiveCorrectness(int maxConsecutive, int indexElements, int bagTotalElements)
{
int min, max;
GetIndexMinMaxGroups(indexElements, maxConsecutive, bagTotalElements, out min, out max);
return min <= max;
}
///=====================================================================///
/// Sets the data for the specified index. ///
/// Can be called after starting to use the bag, ///
/// but if any parameters make the max consecutive repetitions invalid, ///
/// the behaviour of the bag will be undefined. ///
///=====================================================================///
public void SetIndexData(int index, int count, int maxConsecutive)
{
IndexData data = IndexesDataList[index];
_RemainingElementsCount += count - data.Count;
data.Count = count;
data.MaxConsecutive = maxConsecutive;
}
///====================================================================================================================///
/// Retrieves the next random index. The caller must check if there are remaining elements in the bag to be retrieved. ///
///====================================================================================================================///
public int GetNextRandomIndex()
{
/*** GET THE INDEX ***/
int index;
// If, for any index, the minimum possible groups equals the maximum, it must be the returned index:
for (index = 0; index < IndexesDataList.Length; index++)
{
IndexData data = IndexesDataList[index];
int minGroups, maxGroups;
GetIndexMinMaxGroups(data.Count, data.MaxConsecutive, _RemainingElementsCount, out minGroups, out maxGroups);
if (minGroups == maxGroups)
goto _INDEX_FOUND_;
}
// Get a random number to choose the index:
int rand = RandGenerator.Next(_RemainingElementsCount);
for (index = 0; index < IndexesDataList.Length; index++)
{
IndexData data = IndexesDataList[index];
// This index corresponds with the random number:
if (rand < data.Count)
{
// Check if the index has reached the maximum consecutive repetitions;
// in that case, get the next available one:
if (index == LastIndex && data.MaxConsecutive == LastRepetitions)
{
for (int k = 1; k <= IndexesDataList.Length - 1; k++)
{
int m = WrapIndexSimple(index + k, IndexesDataList.Length);
if (IndexesDataList[m].Count > 0)
{
index = m;
goto _INDEX_FOUND_;
}
}
}
goto _INDEX_FOUND_;
}
// This index doesn't correspond with the random number; update it to check the next index:
else
{
rand -= data.Count;
}
}
/*** INDEX FOUND, UPDATE AND RETURN ***/
_INDEX_FOUND_:
IndexData resultData = IndexesDataList[index];
resultData.Count--;
_RemainingElementsCount--;
if (LastIndex == index)
{
LastRepetitions++;
}
else
{
LastIndex = index;
LastRepetitions = 1;
}
return index;
}
///===============================================================================///
/// Calculates the minimum and maximum possible groups of consecutive repetitions ///
/// for an index with the specified data. ///
/// If any provided data is invalid, the behaviour and results are undefined. ///
///===============================================================================///
static void GetIndexMinMaxGroups(int indexRemainingElements, int indexMaxConsecutive, int bagRemainingElements, out int min, out int max)
{
int rem;
int div = Math.DivRem(indexRemainingElements, indexMaxConsecutive, out rem);
min = rem == 0 ? div : div + 1;
max = bagRemainingElements - indexRemainingElements + 1;
}
///=======================================================///
/// Converts an index out of bounds to a valid value. ///
/// "length" is the number of elements of the collection. ///
/// Only works for indexes that are less than 2*length. ///
///=======================================================///
static int WrapIndexSimple(int index, int length)
{
if (index >= length)
return index - length;
else if (index < 0)
return length + index;
else
return index;
}
}
As I said, so far the class has worked as I wanted, extracting items without ever exceeding the number of repetitions. The problem is that the probabilities of extracting an item at each call of GetNextRandomIndex()
are not what they should be. For example, if I have this:
Item | Count | Max. consecutive
-----+-------+-----------------
a | 30 | 3
X | 10 | 2
If I'm right, there would be 13 different possible sequences. The method should return "a" 12 times for each time it returns "X" in the first 3 calls; for the fourth, it should return "a" 10 times for every 3 times it returns "X"; etc. These are the different sequences:
Seq. 1 | Seq. 2 | Seq. 3 | Seq. 4 | Seq. 5 | Seq. 6 | Seq. 7 | Seq. 8 | Seq. 9 | Seq. 10 | Seq. 11 | Seq. 12 | Seq. 13 | Number of “a” | Number of “X”
-------+--------+--------+--------+--------+--------+--------+--------+--------+---------+---------+---------+---------+---------------+--------------
a | a | a | X | a | a | a | a | a | a | a | a | a | 12 | 1
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | X | X | X | X | X | X | X | X | X | 3 | 10
a | a | a | X | X | a | a | a | a | a | a | a | a | 11 | 2
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | X | X | X | X | X | X | X | X | 4 | 9
a | a | a | X | X | X | a | a | a | a | a | a | a | 10 | 3
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | X | X | X | X | X | X | X | 5 | 8
a | a | a | X | X | X | X | a | a | a | a | a | a | 9 | 4
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | X | X | X | X | X | X | 6 | 7
a | a | a | X | X | X | X | X | a | a | a | a | a | 8 | 5
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | X | X | X | X | X | 7 | 6
a | a | a | X | X | X | X | X | X | a | a | a | a | 7 | 6
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | a | X | X | X | X | 8 | 5
a | a | a | X | X | X | X | X | X | X | a | a | a | 6 | 7
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | a | a | X | X | X | 9 | 4
a | a | a | X | X | X | X | X | X | X | X | a | a | 5 | 8
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | a | a | a | X | X | 10 | 3
a | a | a | X | X | X | X | X | X | X | X | X | a | 4 | 9
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | a | a | a | a | X | 11 | 2
a | a | a | X | X | X | X | X | X | X | X | X | X | 3 | 10
a | a | X | a | a | a | a | a | a | a | a | a | a | 12 | 1
a | X | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
X | a | a | a | a | a | a | a | a | a | a | a | a | 12 | 1
This little console application shows the results of the use of the class. For the specified number of iterations, it fills the bag and extracts all the random items. At the end, it shows the maximum number of consecutive repetitions for each item, and the accumulated number of occurrences of each item and each call to GetNextRandomIndex()
for all the iterations. As you can see, after 1 million iterations, the accumulated elements for the first call are around 750,000 for "a" and 250,000 for "X", and for the last call are around 999,950 for "a" and 50 for "X":
static void Main(string[] args)
{
int MaxIterations = 1000000;
bool ShowResults = true;
string[] items = new string[] { "a", "X" };
int[] count = new int[] { 30, 10 };
int[] maxRepet = new int[] { 3, 2 };
int elementCount = count.Sum();
bool maxRepetOK = true;
for (int i = 0; i < items.Length; i++)
maxRepetOK = maxRepetOK && ShuffleBag.CheckIndexMaxConsecutiveCorrectness(maxRepet[i], count[i], elementCount);
if (! maxRepetOK)
{
Console.WriteLine("*** Bad number of repetitions!! ***\n");
goto _END_;
}
Dictionary<string, Tuple<int,int>> results = new Dictionary<string,Tuple<int,int>>();
for (int i = 0; i < items.Length; i++)
results.Add(items[i], new Tuple<int,int>(0, 0));
int[,] resultsPerCall = new int[elementCount, items.Length];
ShuffleBag bag = new ShuffleBag(items.Length);
int iterations = 0;
for (int x = 0; x < MaxIterations; x++)
{
iterations++;
bag.Reset();
for (int i = 0; i < items.Length; i++)
bag.SetIndexData(i, count[i], maxRepet[i]);
string prevItem = "";
int prevRepetitions = 0;
int row = 0;
while (bag.RemainingElementsCount > 0)
{
int newIndex = bag.GetNextRandomIndex();
if (prevItem == items[newIndex])
{
prevRepetitions++;
}
else
{
prevItem = items[newIndex];
prevRepetitions = 1;
}
var resultAnt = results[items[newIndex]];
results[items[newIndex]] = new Tuple<int,int>(resultAnt.Item1 + 1, Math.Max(prevRepetitions, resultAnt.Item2));
resultsPerCall[row, newIndex]++;
if (ShowResults)
{
Console.WriteLine(items[newIndex]);
}
row++;
}
if (ShowResults && MaxIterations > 1)
{
Console.WriteLine("\nESC:\tEnd\nENTER:\tNext iteration\nTAB:\tAll iterations");
while (true)
{
switch (Console.ReadKey(true).Key)
{
case ConsoleKey.Enter:
goto _CONTINUE_;
case ConsoleKey.Escape:
goto _RESULTS_;
case ConsoleKey.Tab:
ShowResults = false;
goto _CONTINUE_;
}
}
_CONTINUE_:
Console.Clear();
Console.WriteLine("Iterating ...");
}
}
_RESULTS_:
Console.Clear();
Console.WriteLine("RESULTS\n------------------------------------------");
for (int i = 0; i < items.Length; i++)
{
var data = results[items[i]];
double average = (double)data.Item1 / iterations;
Console.WriteLine(items[i] +
": Average = " + average.ToString() + (average != count[i] ? "(!)" : "") +
"\t\tMax. repetitions = " + data.Item2.ToString() + (data.Item2 > maxRepet[i] ? "(!)" : ""));
}
Console.WriteLine();
for (int i = 0; i < elementCount; i++)
{
Console.Write((i+1).ToString("00") + ")");
for (int k = 0; k < items.Length; k++)
{
Console.Write(" \t" + items[k] + ": " + resultsPerCall[i, k].ToString());
}
Console.WriteLine();
}
_END_:
Console.WriteLine("\n\nESC to exit");
while (Console.ReadKey(true).Key != ConsoleKey.Escape);
}
The problem is that, from GetNextRandomIndex()
, I only use the number of remaining elements of an item as the probability for selecting it, so for the first call the probability of getting "a" is 3 times the probability of getting "X" (since there are 30 elements of "a" and 10 of "X"). Does anybody have any idea of how can I change my algorithm (or use a different one) to get the correct probabilities?
Upvotes: 2
Views: 229
Reputation: 65498
Here's an approach for exact sampling that did not work as well as I had hoped. I'm posting it in case it proves useful anyway. Given the maximum run lengths, prepare an automaton that recognizes the valid strings.
def make_automaton(max_consecutive):
states = {letter * j for letter, k in max_consecutive.items() for j in range(1, k + 1)}
states.add('')
automaton = {}
for state in states:
transitions = {}
for letter in max_consecutive.keys():
new_state = state + letter if letter == state[-1:] else letter
if new_state in states:
transitions[letter] = new_state
automaton[state] = transitions
return automaton
Here's the code in action.
>>> from pprint import pprint
>>> pprint(make_automaton({'a': 3, 'X': 2}))
{'': {'X': 'X', 'a': 'a'},
'X': {'X': 'XX', 'a': 'a'},
'XX': {'a': 'a'},
'a': {'X': 'X', 'a': 'aa'},
'aa': {'X': 'X', 'a': 'aaa'},
'aaa': {'X': 'X'}}
Now, there's an approach to exact sampling that seems to use too much space. Prepare a table indexed by pairs of an automaton state and a multiset of letters. The table values give the number of words with specified letter counts that put the automaton in the specified state. Then we can use this table by sampling the final state and using conditional probabilities to derive the string that got the automaton there.
My idea was, instead of trying to remember the counts, sample from the following distribution until we get the letter counts right. Each letter is distributed independently according to the counts. We condition on the resulting word being valid.
Here's the rest of the code. It seems to work OK as long as the counts aren't too large and the runs aren't too short.
from collections import defaultdict
def make_probabilities(count, automaton):
probabilities = [{min(automaton): 1.0}]
total = sum(count.values())
for i in range(1, total + 1):
distribution = defaultdict(float)
for state, p in probabilities[-1].items():
transitions = automaton[state]
for letter, k in count.items():
if letter in transitions:
distribution[transitions[letter]] += p * (k / total)
probabilities.append(dict(distribution))
return probabilities
pprint(make_probabilities({'a': 30, 'X': 10}, make_automaton({'a': 3, 'X': 2})))
from random import random
def weighted_sample(distribution):
while True:
sample = random() * sum(distribution.values())
for letter, k in distribution.items():
sample -= k
if sample < 0.0:
return letter
from collections import Counter
print(Counter(weighted_sample({'a': 30, 'X': 10}) for i in range(10000)))
def unbiased_sample(count, max_consecutive):
automaton = make_automaton(max_consecutive)
probabilities = make_probabilities(count, automaton)
total = sum(count.values())
while True:
sample = []
state = weighted_sample(probabilities[-1])
for i in range(len(probabilities) - 2, -1, -1):
conditional_distribution = {}
for old_state, p in probabilities[i].items():
transitions = automaton[old_state]
for letter, k in count.items():
if letter in transitions and transitions[letter] == state:
conditional_distribution[(old_state, letter)] = p * (k / total)
state, letter = weighted_sample(conditional_distribution)
sample.append(letter)
if Counter(sample) == count:
return ''.join(sample)
for r in range(1000):
print(unbiased_sample({'A': 10, 'B': 10, 'C': 10, 'D': 10}, {'A': 5, 'B': 4, 'C': 3, 'D': 2}))
Upvotes: 1
Reputation: 65498
As promised, here’s a Markov chain Monte Carlo sampler. I can’t seem to
get an exact version using the Propp–Wilson technique; this one merely
converges to a distribution that is biased toward and uniform on the
desired outcomes, possibly rather slowly. Define the score of a
permutation to be the number of invalid letters. Specifically, if A
should appear at most once consecutively and B
should appear at most
twice consecutively, then the score of
AAABBAABBBB
^^ ^ ^^
is 5
, where the contributing letters are marked with ^
.
Starting with a random permutation, repeatedly choose two positions with replacement independently and uniformly at random. Score the permutation with the letters at those positions swapped; this is the proposed permutation. Compute two to the power of (the current score - the proposed score). If a uniform random float between zero and one is less than this number, then keep this proposed swap; otherwise, undo it. Do this as long as you can stand, then take the next valid permutation.
Upvotes: 1