Martin v. Löwis
Martin v. Löwis

Reputation: 127587

random.choice not random

I'm using Python 2.5 on Linux, in multiple parallel FCGI processes. I use

    chars = string.ascii_letters + string.digits
    cookie = ''.join([random.choice(chars) for x in range(32)])

to generate distinct cookies. Assuming that the RNG is seeded from /dev/urandom, and that the sequence of random numbers comes from the Mersenne twister, I would expect that there is practically zero chance of collision.

However, I do see regular collisions, even though only a few (<100) users are logged in at any time.

Why are the random numbers not more random?

Upvotes: 16

Views: 11090

Answers (5)

bobince
bobince

Reputation: 536775

  1. I don't know how your FCGI processes are being spawned, but is it possible that it's using fork() after the Python interpreter has started (and the random module has been imported by something), hence effectively seeding two processes' random._insts from the same source?

  2. Maybe put some debugging in to check that it is correctly seeding from urandom, and not falling back to the less rigorous time-based seed?

eta re comment: man! That's me stumped then; if the RNG always has different state at startup I can't see how you could possibly get collisions. Weird. Would have to put in a lot of state logging to investigate the particular cases which result in collisions, I guess, which sounds like a lot of work trawling through logs. Could it be (1a) the FCGI server usually doesn't fork, but occasionally does (maybe under load, or something)?

Or (3) some higher-level problem such as a broken HTTP proxy passing the same Set-Cookie to multiple clients?

Upvotes: 1

ilya n.
ilya n.

Reputation: 18826

I had to erase my original answer, which suggested that generator is not seeded from /dev/urandom, since its source (for Python 3.x) clearly says that it is:

def seed(self, a=None):
    """Initialize internal state from hashable object.

    None or no argument seeds from current time or from an operating
    system specific randomness source if available.

    If a is not None or an int or long, hash(a) is used instead.
    """

    if a is None:
        try:
            a = int(_hexlify(_urandom(16)), 16)
        except NotImplementedError:
            import time
            a = int(time.time() * 256) # use fractional seconds

    super().seed(a)
    self.gauss_next = None

I therefore humbly accept that there are mysteries in the world that I may not be able to decipher.

Upvotes: 0

orip
orip

Reputation: 75567

This is definitely not a normal collision scenario:

  • 32 characters with 62 options per character is equivalent to 190 bits (log2(62) * 32)
  • According to the birthday paradox, you should be receiving a collision naturally once every 2**95 cookies, which means never

Could this be a concurrency issue?

  • If so, use different random.Random instances for each thread
  • Can save these instances in thread-local storage (threading.local())
  • On linux, Python should seed them using os.urandom() - not system time - so you should get different streams for each thread.

Upvotes: 4

Glenn Maynard
Glenn Maynard

Reputation: 57554

It shouldn't be generating duplicates.

import random
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def gen():
    return ''.join([random.choice(chars) for x in range(32)])

test = [gen() for i in range(100000)]
print len(test), len(set(test)) # 100000 100000

The chances of duplicates is significant with chars = "ab"; 126 duplicates in 1000000 iterations. It's nonexistant with 62.

That said, this isn't a good way to generate cookies, because session cookies need to be unpredictable, to avoid attacks involving stealing other people's session cookies. The Mersenne Twister is not designed for generating secure random numbers. This is what I do:

import os, hashlib
def gen():
    return hashlib.sha1(os.urandom(512)).hexdigest()

test = [gen() for i in range(100000)]
print len(test), len(set(test))

... which should be very secure (which is to say, difficult to take a string of session cookies and guess other existing session cookies from them).

Upvotes: 13

pvoosten
pvoosten

Reputation: 3287

To avoid the problem, you can use a sequence of cookies, that are guaranteed to be different (you can e.g. use a set). Each time you give a cookie to someone, you take it from the sequence and you add another to it. Another option is to generate a UUID and use that as a cookie.

Another way to avoid the problem could be to hold a private key, and use a (e.g. MD5) checksum of the private key, with a counter value joined to it. The probability for collisions will then be very low. To be safer, add a few more variables to the checksum, like the current time, the ip address of the user, ...

Libraries to generate cookies exist. Any WSGI implementation probably contains a cookie generator.

If you're only interested in how random your strings are, you could generate a file with, say, one million cookies and perform randomness checks on that file. This, however, is not what I would recommend.

Upvotes: -4

Related Questions