markt1964
markt1964

Reputation: 2836

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.

The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.

I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.

Upvotes: 7

Views: 2494

Answers (7)

samgak
samgak

Reputation: 24417

You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:

int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint. 
{
     std::default_random_engine generator;
     std::uniform_real_distribution<double> distribution(0.0,1.0);

     int successes = 0;
     for(int trial = 0; trial < n; trial++)
     {
         if((int)(distribution(generator) * N) < K)
         {
             successes++;
             K--;
         }
         N--;
     }
     return successes;
}

select_k_from_n(int start, int k, int n)
{
    if(k == 0)
        return;
    if(k == 1)
    {
        output start + random(1 to n);
        return;
    }

    // find the number of results below the mid-point:
    int k1 = sample_hypergeometric(k, n >> 1, n);
    select_k_from_n(start, k1, n >> 1);
    select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
} 

Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).

Upvotes: 2

sjrowlinson
sjrowlinson

Reputation: 3355

As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:

#include <random>
#include <set>

int main(void) {
    std::set<int> random_set;
    std::random_device rd;
    std::mt19937 mt_eng(rd());
    // min and max of random set range
    const int m = 0; // min
    const int n = 100; // max
    std::uniform_int_distribution<> dist(m,n);

    // number to generate
    const int k = 50;
    for (int i = 0; i < k; ++i) {
        // only non-previously occurring values will be inserted
        if (!random_set.insert(dist(mt_eng)).second)
            --i;
    }
}

Upvotes: 2

Vlad Shcherbina
Vlad Shcherbina

Reputation: 179

If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:

assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
    if (rnd(engine) * (n - i) < k) {
        std::cout << i << std::endl;
        k--;
    }
}

It produces all ascending sequences with equal probability.

Upvotes: 4

Jim Mischel
Jim Mischel

Reputation: 134015

Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.

Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.

An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.

So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.

Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.

This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.

This pseudocode shows the basic idea.

pick_k_from_n(n, k)
{
    num_left = k
    last_k = 0;
    while num_left > 0
    {
        // divide the remaining range into num_left partitions
        range_size = (n - last_k) / num_left
        // pick a number in the first partition
        r = random(range_size) + last_k + 1
        output(r)
        last_k = r
        num_left = num_left - 1
    }
}

Note that this takes O(k) time and requires O(1) extra space.

Upvotes: 1

David Eisenstat
David Eisenstat

Reputation: 65468

Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.

For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.

# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
    counts = [0] * b  # list of b zeros
    for j in range(n - k, n):
        t = random.randrange(j + 1)
        if t // b >= counts[t % b]:  # intuitively, "t is not in the set"
            counts[t % b] += 1
        else:
            counts[j % b] += 1
    return counts

Upvotes: 0

Stefan Haustein
Stefan Haustein

Reputation: 18803

Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?

IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?

Upvotes: 0

Lee Daniel Crocker
Lee Daniel Crocker

Reputation: 13181

You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).

To select k items from among n:

for j = n-k+1 to n:
  select random x from 1 to j
  if x is already in hash:
    insert j into hash
  else
    insert x into hash

That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).

Upvotes: 0

Related Questions