Alcott
Alcott

Reputation: 18585

how to generate 1 million random integers and write them to a file?

I was trying to do some tests on my external sorting algorithms, and I thought I should generate a huge amount of random numbers and put them into a file.

Here is how I do it:

import tempfile, random

nf = tempfile.NamedTemporaryFile(delete=False)
i = 0
while i < 1000:
    j = 0
    buf = ''
    while j < 1000:
        buf += str(random.randint(0, 1000))
        j += 1
    nf.write(buf)
    i += 1

I thought, I should speed up the generating process by reducing the File IO operations, so I use buf to store as many numbers as possible, then write buf to the file.

Question:

I still got a sense that, the generating and writing process was slow.

Am I getting something wrong?

EDIT:

In C++, we can simply write an int or a float into file by << without converting them into string.

So can we do the same in Python? I mean write an integer into file without converting it into str.

Upvotes: 4

Views: 5319

Answers (6)

Eric O. Lebigot
Eric O. Lebigot

Reputation: 94485

Operating systems are already optimized for such I/O operations. So, you can directly write the numbers to file and get a very good speed:

import tempfile, random

with tempfile.NamedTemporaryFile(delete=False) as nf:
    for _ in xrange(1000000):  # xrange() is more efficient than range(), in Python 2
        nf.write(str(random.randint(0, 1000)))

In practice, the numbers will only be written to the disk when the size-optimized file buffer is full. The code in the question and the code above take the same time, on my machine. So, I would advise to use my simpler code and to rely on the operating system's built-in optimizations.

If the result fits in memory (which is the case for 1 million numbers), then you can indeed save some I/O operations by creating the final string and then writing it in one go:

with tempfile.NamedTemporaryFile(delete=False) as nf:
    nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000000)))

This second approach is 30% faster, on my computer (2.6 s instead of 3.8 s), probably thanks to the single write call (instead of a million write() calls–and probably many fewer actual disk writes).

The "many big writes" approach of your question falls in the middle (3.1 s). It can be improved, though: it is clearer and more Pythonic to write it like this:

import tempfile, random

with tempfile.NamedTemporaryFile(delete=False) as nf:
    for _ in xrange(1000):
        nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000)))

This solution is equivalent to, but faster than the code in the original question (2.6 s on my machine, instead of 3.8 s).

In summary, the first, simple approach above might be fast enough for you. If it is not and if the whole file can fit in memory, the second approach is both very fast and simple. Otherwise, your initial idea (fewer writes, bigger blocks) is good, as it is about as fast as the "single write" approach, and still quite simple, when written as above.

Upvotes: 7

S.Lott
S.Lott

Reputation: 391846

Like this

import random
import struct

with open('binary.dat','wb') as output:
     for i in xrange(1000000):
         u = random.randint(0,999999) # number
         b = struct.pack('i', u) # bytes
         output.write(b)

This will create 4 million bytes of data. 1 million 4-byte values.

You can read up on struct and the various packing options here: http://docs.python.org/library/struct.html.

Upvotes: 1

lostyzd
lostyzd

Reputation: 4523

If you only need to generate some random numbers and you are under linux, try shell command

for i in {1..1000000}; do echo $[($RANDOM % 1000)]; done > test.in

ok, i test this code below, it takes about 5 seconds to finish

import tempfile, random

nf = tempfile.NamedTemporaryFile(delete=False)
for i in xrange(0, 1000000):
    nf.write(str(random.randint(0, 1000)))

Upvotes: 2

Mark Byers
Mark Byers

Reputation: 838156

Don't use string concatenation in a loop. Use str.join instead.

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.

Your code would look like this:

buf = ''.join(str(random.randint(0, 1000)) for j in range(1000))

And note that since you have not specified a separator it will look like this:

3847018274193258124003837134....

Change '' to ',' if you want the numbers to be (for example) comma separated.

I also don't think you need to buffer yourself as writing to a file should already be buffered.

Upvotes: 3

Joel
Joel

Reputation: 5674

Doing a million of anything is going to be relatively slow. Also, depending on how random you want the numbers, you may want to invest in a more robust random integer generator. This is a personal favorite: http://en.wikipedia.org/wiki/Mersenne_twister

Upvotes: -2

David M&#229;rtensson
David M&#229;rtensson

Reputation: 7600

I am not sure for Python but += is usually an expensive operation as it copies the string to new memory.

Using some string builder or array that you join is probably much faster.

Upvotes: 1

Related Questions