forthrin
forthrin

Reputation: 2777

Picking random indexes into a sorted array

Let's say I have a sorted array of values:

int n=4; // always lower or equal than number of unique values in array
int i[256] = {};
int v = {1 1 2 4 5 5 5 5 5 7 7 9 9 11 11 13}
// EX 1        ^         ^       ^       ^
// EX 2    ^                 ^         ^ ^
// EX 3    ^ ^           ^               ^

I would like to generate n random index values i[0] ... i[n-1], so that:

  1. v[i[0]] ... v[i[n-1]] points to a unique number (ie. must not point to 5 twice)
  2. Each number to must be the rightmost of its kind (ie. must point to the last 5)
  3. An index to the final number (13 in this case) should always be included.

What I've tried so far:

  1. Getting the indexes to the last of the unique values
  2. Shuffling the indexes
  3. Pick out the n first indexes

I'm implementing this in C, so the more standard C functions I can rely on and the shorter code, the better. (For example, shuffle is not a standard C function, but if I must, I must.)

Upvotes: 1

Views: 190

Answers (2)

rici
rici

Reputation: 241701

This algorithm is called reservoir sampling, and can be used whenever you know how big a sample you need but not how many elements you're sampling from. (The name comes from the idea that you always maintain a reservoir of the correct number of samples. When a new value comes in, you mix it into the reservoir, remove a random element, and continue.)

  1. Create the return value array sample of size n.
  2. Start scanning the input array. Each time you find a new value, add its index to the end of sample, until you have n sampled elements.
  3. Continue scanning the array, but now when you find a new value:

    a. Choose a random number r in the range [0, i) where i is the number of unique values seen so far.

    b. If r is less than n, overwrite element r with the new element.

  4. When you get to the end, sort sample, assuming you need it to be sorted.

To make sure you always have the last element in the sample, run the above algorithm to select a sample of size n-1. Only consider a new element when you have found a bigger one.

The algorithm is linear in the size of v (plus an n log n term for the sort in the last step.) If you already have the list of last indices of each value, there are faster algorithms (but then you would know the size of the universe before you started sampling; reservoir sampling is primarily useful if you don't know that.)

In fact, it is not conceptually different from collecting all the indices and then finding the prefix of a Fisher-Yates shuffle. But it uses O(n) temporary memory instead of enough to store the entire index list, which may be considered a plus.

Here's an untested sample C implementation (which requires you to write the function randrange()):

/* Produces (in `out`) a uniformly distributed sample of maximum size
 * `outlen` of the indices of the last occurrences of each unique
 * element in `in` with the requirement that the last element must
 * be in the sample.
 * Requires: `in` must be sorted.
 * Returns: the size of the generated sample, while will be `outlen` 
 *          unless there were not enough unique elements.
 * Note: `out` is not sorted, except that the last element in the
 *       generated sample is the last valid index in `in`
 */
size_t sample(int* in, size_t inlen, size_t* out, size_t outlen) {
  size_t found = 0;
  if (inlen && outlen) {
    // The last output is fixed so we need outlen-1 random indices
    --outlen; 
    int prev = in[0];
    for (size_t curr = 1; curr < inlen; ++curr) {
      if (in[curr] == prev) continue;
      // Add curr - 1 to the output
      size_t r = randrange(0, ++found);
      if (r < outlen) out[r] = curr - 1;
      prev = in[curr];
    }
    // Add the last index to the output
    if (found > outlen) found = outlen;
    out[found] = inlen - 1;
  }
  return found;
}

Upvotes: 5

user3386109
user3386109

Reputation: 34829

Create an array of the last index values

int last[] = { 1, 2, 3, 8, 10, 12, 14 };

Fisher-Yates shuffle the array.

Take the first n-1 elements from the shuffled array.

Add the index to the final number.

Sort the resulting array, if desired.

Upvotes: 5

Related Questions