MadOgre
MadOgre

Reputation: 541

Shuffling two lists together with no repetitions in the second list

I have 2 vectors of strings (one is roughly 1/3 the size of the other). I am trying to implement an algorithm that will randomly shuffle the two vectors together. In the resulting vector items that were previously in vector A can follow each other, but the ones that were in vector B cannot.

For example if every element in vector A is "FOO" and every element in vector B is "BAR", then the resulting vector might be {"FOO","FOO","BAR","FOO","BAR","FOO","FOO","BAR"}

As you can see "FOO" may repeat but "BAR" must not

This is roughly what I have so far:

#include <string>
#include <chrono>
#include <algorithm>
#include <random>
#include <vector>

std::vector<std::string> a(1000, "FOO");
std::vector<std::string> b(300, "BAR");
std::vector<std::string> result;

bool resultIsValid();

void mergeVectors()
{
    unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
    std::mt19937 generator(seed);

    result = a;
    result.insert(result.end(), b.begin(), b.end());
    while (!resultIsValid())
    {
        std::shuffle(a.begin(), a.end(), generator);
    }
}

bool resultIsValid()
{
    for(int i=0; i<result.size()-2; ++i)
        if (result[i] == "BAR" && result[i+1] == "BAR")
            return false;
    return true;
}

This is not the actual code but this should give an idea. When I run this, the program goes into the endless loop because the actual numbers of strings are much higher (in the 10000 range) and it never gets the valid vector. There is always at least one of "BAR" repeating sequentially. Would anyone be able to suggest a better alternative then to just keep rechecking the created vector for duplicates of "BAR" ? Am I making this more complicated than it has to be ?

Upvotes: 2

Views: 368

Answers (5)

Simon
Simon

Reputation: 10841

We can compute the probability that the next element in the resulting vector should be from A or B based on the number of elements remaining to be allocated from each of A and B and randomly select the next element from either A or B based on that probability.

The probablity of selecting the next element from list A is always 100% if the last element selected was from list B (which prevents consecutive elements from B).

If the number of elements left in A is equal to the number of elements left in B, the probability that one should get the next element from B is 100% if the last element added to the resulting list was from A.

Otherwise, the probability of selecting an element from A should be equal to (A.length() - B.length())/A.length(), assuming that we're taking the elements out of A and B as we put them in the resulting vector, so their lengths decrease to zero over the process. We could determine if the element should come from list A or list B by testing a randomly generated value between 0 and 1 against this probability.

This should guarantee that the As and Bs are evenly shuffled, which one could test by running the program many times and comparing the number of Bs in each half of the resulting vector.

(EDIT)

I thought it was quicker to test this algorithm in Python, so here's my Python implementation:

from random import random

def intersperse(a,b):
    la = len(a)
    lb = len(b)
    ia = 0
    ib = 0
    bWasLast = False
    res = []
    while (ia < la) or (ib < lb):
        if bWasLast:
            res.append(a[ia])
            ia += 1
            bWasLast = False
        elif ((lb - ib) > (la - ia)) and not bWasLast:
            res.append(b[ib])
            ib += 1
            bWasLast = True
        else:
            laRemaining = la - ia
            lbRemaining = lb - ib
            probA = (laRemaining - lbRemaining)/laRemaining
            if random() < probA:
                res.append(a[ia])
                ia += 1
                bWasLast = False
            else:
                res.append(b[ib])
                ib += 1
                bWasLast = True
    return res

Test code is as follows:

A = 'a'*10000
B = 'b'*3000

sumBLeft = 0
sumBRight = 0

for n in range(100):
    r = intersperse(A,B)
    sumBLeft += sum([1 for x in r[:len(r)//2] if x == 'b'])
    sumBRight += sum([1 for x in r[len(r)//2:] if x == 'b'])

print (sumBLeft/sumBRight)

The output from this program showed that the number of 'b's in the left side of the result was only 0.08% greater than the number of 'b's in the right side of the result for my test run, confirming that the distribution of 'b's through the result vector is even.

Upvotes: 2

Arne Mertz
Arne Mertz

Reputation: 24616

As far as I can see from your code, you are shuffling for good luck until the mix you get is a valid one. My approach would be something like the following:

Be foo the vector with the "FOO" items and bar the vector with the "BAR" items, and be F and B their respective sizes.
Any bar item must be preceded by a "FOO" item, except if the "BAR" item is the first item in the result sequence. So if we draw together the "BAR"s and their preceding "FOO"s, we get B "FOOBAR" sequences (or B-1, if we start with a "BAR") and F-B (or F-(B-1)) "FOO"s in between. The probability that a result sequence starts with a "BAR" item is B/(F+1).

I'd first make a pattern vector for the mix of "BAR" and "FOO" items, and after that mix together the real result vector. The pattern would consist of "FOOBAR" and "FOO" items, since a "BAR" item must be preceded by a "FOO" except if the result sequence starts with "BAR"

Pseudocode:

bool startsWithBar = B < (random * (F+1));
//if the sequence starts with a BAR, there are B-1 FOOBAR items
int nFOOBAR = B - (startsWithBar ? 1 : 0); 
int nFOO = F - nFOOBAR;

vector<char> pattern(nFOOBAR, 'm'); //m for mix - FOOBAR = FOO followed by a BAR
pattern.insert(pattern.end(), nFOO, 'f');

shuffle(pattern);
shuffle(foo);
shuffle(bar);

vector<string> result;
result.reserve(F+B);

auto itBar = begin(bar);
auto itFoo = begin(foo);

//fill the result according to the pattern
if (startsWithBar)
  result.push_back(*itBar++);

for (char patt : pattern)
{
  switch (patt)
  {
    case 'f': //"FOO"
      result.push_back(*itFoo++);
      break;
    case 'm': //"FOOBAR"
      result.push_back(*itFoo++);
      result.push_back(*itBar++);
      break;
  }
}

For you example {"FOO","FOO","BAR","FOO","BAR","FOO","FOO","BAR"}, the pattern would be {'f', 'm', 'm', 'f', 'm'}

Upvotes: 0

ipc
ipc

Reputation: 8143

The resulting list consists of "BAR","FOO" and "FOO" elements. For example

{"FOO","FOO","BAR","FOO","BAR","FOO","FOO","BAR","FOO"}

can be split up to

"FOO" | "FOO" | "BAR","FOO" | "BAR","FOO" | "FOO" | "BAR","FOO"

which can be compressed to

{0, 0, 1, 1, 0, 1}

where 0 means single element and 1 means a transition from "BAR" to "FOO".

The number of 0s and 1s is invariant, so one can generate a vector containing these and shuffle it.

The only problem is at the end, where a single "BAR" is also valid (the same problem arises at the beginning if you look at "BAR","FOO" as primitive element).

This can be solved if the vector containing "FOO" is increased by 1 dummy element (sentinel). The resulting list always ends with an element of "FOO" but is otherwise truly random. But we can safely remove the last element since this is our dummy.

A simple code (without templating on Iterators and Allocators) implementing the algorithm could look like this:

std::vector<std::string> mergeVectors(std::vector<std::string> const& canvas,
                                      std::vector<std::string> const& sprinkle)
{
  assert (canvas.size() + 1>= sprinkle.size()); // otherwise impossible

  std::vector<int> transitions; // 1 for [sprinkle, canvas]
                                // 0 for single [canvas]

  // sprinkle.size() times [canvas, sprinkle]
  transitions.insert(transitions.end(), sprinkle.size(), 1);
  // rest is [canvas].
  transitions.insert(transitions.end(), canvas.size() - sprinkle.size() + 1, 0);

  // There is a problem with the last element since this always is from canvas
  // as well.  So we set the last canvas to a sentinel element which is always removed.
  // This way, we can ensure that the result is truly randomly distributed.

  std::mt19937 generator(std::chrono::system_clock::now().time_since_epoch().count());
  std::shuffle(transitions.begin(), transitions.end(), generator);

  bool last_is_sprinkle = transitions.back(); transitions.pop_back();

  std::vector<std::string> result;
  auto canvas_it   = canvas.begin();
  auto sprinkle_it = sprinkle.begin();

  for (auto t : transitions) {
    if (t) result.push_back(*sprinkle_it++);
    result.push_back(*canvas_it++);
  }
  if (last_is_sprinkle)
    result.push_back(*sprinkle_it);
  return result;
}

Upvotes: 6

Svalorzen
Svalorzen

Reputation: 5608

An idea would be to do as follows:

  • Shuffle A.
  • Initialise the resulting vector C as shuffled B. At this point the elements in C are |B|, and you know you will have to put at least |B|-1 elements from A between them.
  • Pop elements from the shuffled A vector, inserting them in the appropriate positions within C, so as to avoid repeated B elements within C.
  • Finish off perform |A|-(|B|-1) insertions into the C vector, at random positions, using the remaining elements within A.

Suppose you have:

A = {FOO1, FOO2, FOO3, FOO4, FOO5}
B = {BAR1, BAR2, BAR3}

Step 1 & 2:

A = {FOO4, FOO3, FOO1, FOO5, FOO2}
C = {BAR3, BAR1, BAR2}

Step 3:

A = {FOO1, FOO5, FOO2}
C = {BAR3, FOO4, BAR1, FOO3, BAR2}

Step 4:

C = {FOO1, FOO5, BAR3, FOO4, BAR1, FOO3, FOO2, BAR2}

Upvotes: 0

user2218982
user2218982

Reputation:

If your second vector v2 is shorter than v1 you have to copy each element of v2 into the output vector vout and append one or more elements of v1 each time. One solution would be, to enumerate all possible indices of v1 and randomly erase them until only as many indices are left as elements are in v2. So, the difference between two adjacent indices is the number of elements in v1 that you insert into vout after an element of v2. For example:

    using size_type = std::vector<std::string>::size_type;

    std::vector<size_type> vid(v1.size());
    std::iota(begin(vid), end(vid), 0);
    std::random_shuffle(begin(vid), end(vid));
    vid.erase(std::next(begin(vid), v2.size()), end(vid));
    std::sort(begin(vid), end(vid));

    size_type id_last = 0;
    for(size_type i = 0; i < vid.size(); ++i) {
        vout.insert(end(vout), std::next(begin(v1), id_last),
                                                 std::next(begin(v1), vid[i]));
        vout.push_back(v2[i]);
        id_last = vid[i];
    }
    vout.insert(end(vout), std::next(begin(v1), vid.back()), end(v1));

This is probably not the fastest method, but it should outline the idea behind it. I believe this whole index management can also be rewritten using some iterator adaptors like in boost. Also, if you don't need the original string vectors after merging, you can move the strings instead of copying them.

Upvotes: 0

Related Questions