Random number distribution is uneven / non uniform

Question

I have noticed a strange issue with the random number generation in c#, it looks like sets (patterns) are repeated a lot more often than you would expect.

I'm writing a mechanism that generates activation codes, a series of 7 numbers (range 0-29). Doing the math, there should be 30^7 (22billion) possible combinations of activation codes. Based on this it should be extremely unlikely to get a duplicate activation code before the 1 billionth code is generated. However running my test, I start getting duplicate codes after about 60,000 iteration, which is very surprising. I have also tried using RNGCryptoServiceProvider with similar results, I get duplicates at about 100,000 iterations.

I would really like to know if this is a bug/limitation of the random number generation in .Net or if I'm doing something wrong.

The following code is a test to validate the uniqueness of the generated codes:

        static void Main(string[] args)
    {
        Random rand = new Random();
        RandomActivationCode(rand, true);
        Console.Out.WriteLine("Press enter");
        Console.ReadLine();
    }

    static void RandomActivationCode(Random randomGenerator)
    {
        var maxItems = 11000000;
        var list = new List(maxItems);
        var activationCodes = new HashSet(list);
        activationCodes.Clear();
        DateTime start = DateTime.Now;
        for (int i = 0; i < maxItems; ++i)
        {
            string activationCode = "";
            for (int j = 0; j < 7; ++j)
            {
                activationCode += randomGenerator.Next(0,30) + "-";
            }
            if (activationCodes.Contains(activationCode))
            {
                Console.Out.WriteLine("Code: " + activationCode);
                Console.Out.WriteLine("Duplicate at iteration: " + i.ToString("##,#"));
                Console.Out.WriteLine("Press enter");
                Console.ReadLine();
                Console.Out.WriteLine();
                Console.Out.WriteLine();
            }
            else
            {
                activationCodes.Add(activationCode);
            }
            if (i % 100000 == 0)
            {

                Console.Out.WriteLine("Iteration: " + i.ToString("##,#"));
                Console.Out.WriteLine("Time elapsed: " + (DateTime.Now - start));
            }

        }
    }

My workaround is to use 10 number activation codes, which means that the test runs without any duplicate values being generated. The test runs up to 11 million iterations (after which point it runs out of memory).

Eric Lippert · Accepted Answer

This is not at all surprising; this is exactly what you should expect. Your belief that it should take a long time to generate duplicates when the space of possibilities is large is simply false, so stop believing that. Start believing the truth: that if there are n possible codes then you should start getting duplicates at about the square root of n codes generated, which is about 150 thousand if n is 22 billion.

Think about it this way: by the time you have generated root-n codes, most of them have had roughly a root-n-in-n chance to have a collision. Multiply root-n by roughly root-n-in-n, and you get... roughly 100% chance of collision.

That is of course not a rigorous argument, but it should give you the right intution, to replace your faulty belief. If that argument is unconvincing then you might want to read my article on the subject:

http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx

If you want to generate a unique code then generate a GUID; that's what they're for. Note that a GUID is not guaranteed to be random, it is only guaranteed to be unique.

Another choice for generating random seeming codes that are not actually random at all, but are unique, is to generate the numbers 1, 2, 3, 4, ... as many as you want, and then use the multiplicative inverse technique to make a random-looking unique encoding of those numbers. See http://ericlippert.com/2013/11/14/a-practical-use-of-multiplicative-inverses/ for details.

Random number distribution is uneven / non uniform

Answers (1)

Related Questions