Reputation: 2777
Let's say I have a sorted array of values:
int n=4; // always lower or equal than number of unique values in array
int i[256] = {};
int v = {1 1 2 4 5 5 5 5 5 7 7 9 9 11 11 13}
// EX 1 ^ ^ ^ ^
// EX 2 ^ ^ ^ ^
// EX 3 ^ ^ ^ ^
I would like to generate n random index values i[0] ... i[n-1]
, so that:
v[i[0]] ... v[i[n-1]]
points to a unique number (ie. must not point to 5 twice)What I've tried so far:
I'm implementing this in C, so the more standard C functions I can rely on and the shorter code, the better. (For example, shuffle
is not a standard C function, but if I must, I must.)
Upvotes: 1
Views: 190
Reputation: 241701
This algorithm is called reservoir sampling, and can be used whenever you know how big a sample you need but not how many elements you're sampling from. (The name comes from the idea that you always maintain a reservoir of the correct number of samples. When a new value comes in, you mix it into the reservoir, remove a random element, and continue.)
sample
of size n
.sample
, until you have n
sampled elements.Continue scanning the array, but now when you find a new value:
a. Choose a random number r
in the range [0, i) where i
is the number of unique values seen so far.
b. If r
is less than n
, overwrite element r
with the new element.
When you get to the end, sort sample
, assuming you need it to be sorted.
To make sure you always have the last element in the sample, run the above algorithm to select a sample of size n-1
. Only consider a new element when you have found a bigger one.
The algorithm is linear in the size of v
(plus an n log n
term for the sort in the last step.) If you already have the list of last indices of each value, there are faster algorithms (but then you would know the size of the universe before you started sampling; reservoir sampling is primarily useful if you don't know that.)
In fact, it is not conceptually different from collecting all the indices and then finding the prefix of a Fisher-Yates shuffle. But it uses O(n) temporary memory instead of enough to store the entire index list, which may be considered a plus.
Here's an untested sample C implementation (which requires you to write the function randrange()
):
/* Produces (in `out`) a uniformly distributed sample of maximum size
* `outlen` of the indices of the last occurrences of each unique
* element in `in` with the requirement that the last element must
* be in the sample.
* Requires: `in` must be sorted.
* Returns: the size of the generated sample, while will be `outlen`
* unless there were not enough unique elements.
* Note: `out` is not sorted, except that the last element in the
* generated sample is the last valid index in `in`
*/
size_t sample(int* in, size_t inlen, size_t* out, size_t outlen) {
size_t found = 0;
if (inlen && outlen) {
// The last output is fixed so we need outlen-1 random indices
--outlen;
int prev = in[0];
for (size_t curr = 1; curr < inlen; ++curr) {
if (in[curr] == prev) continue;
// Add curr - 1 to the output
size_t r = randrange(0, ++found);
if (r < outlen) out[r] = curr - 1;
prev = in[curr];
}
// Add the last index to the output
if (found > outlen) found = outlen;
out[found] = inlen - 1;
}
return found;
}
Upvotes: 5
Reputation: 34829
Create an array of the last index values
int last[] = { 1, 2, 3, 8, 10, 12, 14 };
Fisher-Yates shuffle the array.
Take the first n-1
elements from the shuffled array.
Add the index to the final number.
Sort the resulting array, if desired.
Upvotes: 5