devbin
devbin

Reputation: 61

Maximize the number of unique random numbers in a range

I am trying to generate uniformly distributed random numbers with the mt19937 engine and std::random_device as the seed source. If I get lucky I get a couple hundred thousands unique number out of the 4 billion possible values. I was wondering if it could get any better than this.

I attempted to increase the entropy using high resolution timer and the random device using seed_seq (https://stackoverflow.com/a/34493057/5852409) also tried initializing all the 624 states of the mt19937 (https://codereview.stackexchange.com/a/109266). However, did not see any improvement.

#include <random>
#include <iostream>
#include <set>

void main()
{
    std::random_device rd;
    std::mt19937 engn(rd());
    std::uniform_int_distribution<unsigned int> unidist(0, 0xFFFFFFFF - 1);

    std::set<unsigned int> s;
    auto itr = s.insert(unidist(engn));
    int k = 0;
    while (itr.second)
    {
        itr = s.insert(unidist(engn));
        k++;
    }
    std::cout << k << '\n';
}

Upvotes: 0

Views: 110

Answers (1)

Arne Vogel
Arne Vogel

Reputation: 6706

First and foremost, make sure you understand the birthday paradox. I.e. the fact that you get a duplicate after some ten or hundred thousand numbers does not indicate a statistical deficiency in the mt19937.

As a rough estimate due to this paradox, duplicates become likely after about the square root of possible values even for a perfect random generator, in this case after about 2^16 = 65536 values.

Second, note that entropy does not mean uniqueness of outputs. Imagine throwing a perfectly fair 100-sided die 100 times. The likelihood that at least one value appears twice is actually much greater than the likelihood that each value is seen exactly once. Entropy is a measure for the number of states in a system. Entropy in your case relates to the quality of your seed (covering many different initial states of the PRNG), not the uniqueness of outputs.

Third, if you have a use case where you must ensure uniqueness (of IDs or handles, for example), but you need poor predictability aka randomness, you have basically two options:

  • Store "taken" values and "re-roll" as long as necessary. There are also probabilistic algorithms for this that can detect duplicates with much less RAM, at the cost of a small probability of false positives.
  • Use a much larger – more than twice as many bits – handle space and hope that no collision will occur. This is appropriate when occasional collisions are unwanted but do limited harm, such as leading to an expensive theoretically unnecessary recalculation.

Upvotes: 2

Related Questions