Reputation: 3091

An efficient way of making a large random bytearray

I need to create a large bytearry of a specific size but the size is not known prior to run time. The bytes need to be fairly random. The bytearray size may be as small as a few KBs but as large as a several MB. I do not want to iterate byte-by-byte. This is too slow -- I need performance similar to numpy.random. However, I do not have the numpy module available for this project. Is there something part of a standard python install that will do this? Or do i need to compile my own using C?

for those asking for timings:

>>> timeit.timeit('[random.randint(0,128) for i in xrange(1,100000)]',setup='import random', number=100)
35.73110193696641
>>> timeit.timeit('numpy.random.random_integers(0,128,100000)',setup='import numpy', number=100)
0.5785652013481126
>>>

Upvotes: 40

Answers (4)

Richard Thiessen

Reputation: 242

There are several possibilities, some faster than os.urandom. Also consider whether the data has to be generated deterministically from a random seed. This is invaluable for unit tests where failures have to be reproducible.

short and pithy:

lambda n:bytearray(map(random.getrandbits,(8,)*n))

I've use the above for unit tests and it was fast enough but can it be done faster?

using itertools:

lambda n:bytearray(itertools.imap(random.getrandbits,itertools.repeat(8,n))))

itertools and struct producing 8 bytes per iteration

lambda n:(b''.join(map(struct.Struct("!Q").pack,itertools.imap(
    random.getrandbits,itertools.repeat(64,(n+7)//8)))))[:n]

Anything based on b''.join will fill 3-7x the memory consumed by the final bytearray with temporary objects since it queues up all the sub-strings before joining them together and python objects have lots of storage overhead.

Producing large chunks with a specialized function gives better performance and avoids filling memory.

import random,itertools,struct,operator
def randbytes(n,_struct8k=struct.Struct("!1000Q").pack_into):
    if n<8000:
        longs=(n+7)//8
        return struct.pack("!%iQ"%longs,*map(
            random.getrandbits,itertools.repeat(64,longs)))[:n]
    data=bytearray(n);
    for offset in xrange(0,n-7999,8000):
        _struct8k(data,offset,
            *map(random.getrandbits,itertools.repeat(64,1000)))
    offset+=8000
    data[offset:]=randbytes(n-offset)
    return data

Performance

.84 MB/s :original solution with randint:
4.8 MB/s :bytearray(getrandbits(8) for _ in xrange(n)): (solution by other poster)
6.4MB/s :bytearray(map(getrandbits,(8,)*n))
7.2 MB/s :itertools and getrandbits
10 MB/s :os.urandom
23 MB/s :itertools and struct
35 MB/s :optimised function (holds for len = 100MB ... 1KB)

Note:all tests used 10KB as the string size. Results were consistent up till intermediate results filled memory.

Note:os.urandom is meant to provide secure random seeds. Applications expand that seed with their own fast PRNG. Here's an example, using AES in counter mode as a PRNG:

import os
seed=os.urandom(32)

from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.backends import default_backend
backend = default_backend()
cipher = Cipher(algorithms.AES(seed), modes.CTR(b'\0'*16), backend=backend)
encryptor = cipher.encryptor()

nulls=b'\0'*(10**5) #100k
from timeit import timeit
t=timeit(lambda:encryptor.update(nulls),number=10**5) #1GB, (100K*10k)
print("%.1f MB/s"%(1000/t))

This produces pseudorandom data at 180 MB/s. (no hardware AES acceleration, single core) That's only ~5x the speed of the pure python code above.

Addendum

There's a pure python crypto library waiting to be written. Putting the above techniques together with hashlib and stream cipher techniques looks promising. Here's a teaser, a fast string xor (42MB/s).

def xor(a,b):
    s="!%iQ%iB"%divmod(len(a),8)
    return struct.pack(s,*itertools.imap(operator.xor,
        struct.unpack(s,a),
        struct.unpack(s,b)))

Upvotes: 13

Ned Batchelder

Reputation: 376052

The os module provides urandom, even on Windows:

bytearray(os.urandom(1000000))

This seems to perform as quickly as you need, in fact, I get better timings than your numpy (though our machines could be wildly different):

timeit.timeit(lambda:bytearray(os.urandom(1000000)), number=10)
0.0554857286941

Upvotes: 59

dr jimbob

Reputation: 17771

What's wrong with just including numpy? Anyhow, this creates a random N-bit integer:

import random
N = 100000
bits = random.getrandbits(N)

So if you needed to see if the value of the j-th bit is set or not, you can do bits & (2**j)==(2**j)

EDIT: He asked for byte array not bit array. Ned's answer is better: your_byte_array= bytearray((random.getrandbits(8) for i in xrange(N))

Upvotes: 7

Ned Batchelder

Reputation: 376052

import random
def randbytes(n):
    for _ in xrange(n):
        yield random.getrandbits(8)

my_random_bytes = bytearray(randbytes(1000000))

There's probably something in itertools that could help here, there always is...

My timings indicate that this goes about five times faster than [random.randint(0,128) for i in xrange(1,100000)]

Upvotes: 4

An efficient way of making a large random bytearray

Answers (4)

Performance

Addendum

Related Questions