0xhacker
0xhacker

Reputation: 1119

How to make writing to a file faster in python?

I have a task to create big file with random data. I started doing with the following code:

from __future__ import print_function

N=10
rand_file = open("file_name", 'w');

for i in range(1, 7000000):
  print(''.join(random.choice(string.ascii_lowercase)
                for x in range(N)),
        file=rand_file)

Looking at the write throughput I get on disk with this program I feel this is not the fastest way. I would like to create a 100MB contiguous buffer space, write the strings in the buffer, and then flush it to file every time the buffer fills up. How to do this in python? I looked at io.BufferedWriter, but could not understand how to use it to write into a file.

Any suggestions are welcome. Thanks.

Upvotes: 3

Views: 10719

Answers (5)

Daryl
Daryl

Reputation: 419

Python isn't necessarily the easiest or the fastest way to create a large file with random data. The following bash snippet will create a file of the specified length with random data; see this question for source.

dd if=/dev/random iflag=fullblock of=$HOME/randomFile bs=1M count=1

Upvotes: -3

eyquem
eyquem

Reputation: 27565

Importing ascii_lower and choice instead of string and random diminishes the execution's times.
It seems that using the with statement to open the file is cause of a light increasing of execution's time.

Instead of writing 7000 lines (in my code I took this number instead of 7000000) , the idea of the last code is to group a number of lines in a string where they are linked by \n before to print this string in the file.
Doing so, the number of calls to print() is lowered.

To obtain the same total number of lines whith a number of grouped lines that isn't a divisor of the total number of lines, it needs to do some tricky computings in the for-loop and the xrange (better undesrtood by seeing the code).

I also choosed a buffer size in order that it is equal to the number of bits in the string grouping several lines, while being a multiple of 1024.
Each line of the file must contain 10 characters. The grouped lines are linked with \n --> it makes 11 characters. The last grouped line doesn't have a \n after it but when print() will act, it will add such a character `\n.
So for n grouped lines, there are n * 11 characters in the grouping string. As a character==8 bits, it makes n*11*8 = n*88. Then to find n it's easy: it must verify n*88 = buffer_size. We have just to manage to take a buffer_size multiple of 1024 and multiple of 88 at the same time.

EDIT

It appears that trying to adjust the buffer's size doesn't provide a benefit, it's even the contary !

from __future__ import print_function

from time import clock
from os.path import getsize

N=10
A,B,C,D,E,F = [],[],[],[],[],[]
repet = 20
total_lines = 7000

.

import random
import string
for i in xrange(repet):
  te = clock()
  rand_file1 = open("file_name1", 'w')
  for i in range(total_lines):
    print(''.join(random.choice(string.ascii_lowercase)
                  for x in range(N)),
          file=rand_file1)
  rand_file1.close()
  A.append(clock()-te)

.

import random
import string
for i in xrange(repet):
  te = clock()
  with open("file_name2", 'w') as rand_file2:
    for i in range(total_lines):
      print(''.join(random.choice(string.ascii_lowercase)
                    for x in range(N)),
            file=rand_file2)
  B.append(clock()-te)

.

import random
from string import ascii_lowercase
for i in xrange(repet):
  te = clock()
  rand_file3 = open("file_name3", 'w')
  for i in range(total_lines):
    print(''.join(random.choice(ascii_lowercase)
                  for x in range(N)),
          file=rand_file3)
  rand_file3.close()
  C.append(clock()-te)

.

from random import choice
from string import ascii_lowercase
for i in xrange(repet):
  te = clock()
  rand_file4 = open("file_name4", 'w')
  for i in range(total_lines):
    print(''.join(choice(ascii_lowercase)
                  for x in range(N)),
          file=rand_file4)
  rand_file4.close()
  D.append(clock()-te)

.

from random import choice
from string import ascii_lowercase
buffer_size = 22528
grouped_lines = buffer_size/(11*8)
for i in xrange(repet):
  te = clock()
  rand_file5 = open("file_name5", 'w') # <== no buffer's size adjusted here
  for i in range(0, total_lines, grouped_lines):
    u = '\n'.join(''.join(choice(ascii_lowercase)
                            for x in range(N))
                  for y in xrange(min(grouped_lines,total_lines-i)))
    print(u,file=rand_file5)
  rand_file5.close()
  E.append(clock()-te)

.

from random import choice
from string import ascii_lowercase
buffer_size = 22528
grouped_lines = buffer_size/(11*8)
for i in xrange(repet):
  te = clock()
  rand_file6 = open("file_name6", 'w', buffer_size)
  for i in range(0, total_lines, grouped_lines):
    u = '\n'.join(''.join(choice(ascii_lowercase)
                            for x in range(N))
                  for y in xrange(min(grouped_lines,total_lines-i)))
    print(u,file=rand_file6)
  rand_file6.close()
  F.append(clock()-te)

.

t1,t2,t3,t4,t5,t6=map(min,(A,B,C,D,E,F))
print ('1  %s\n'
       '2  %s  %.3f %%\n'
       '3  %s  %.3f %%\n'
       '4  %s  %.3f %%\n'
       '5  %s  %.3f %%\n'
       '6  %s  %.3f %%\n'
       % (t1,
          t2,t2/t1*100,
          t3,t3/t1*100,
          t4,t4/t1*100,
          t5,t5/t1*100,
          t6,t6/t1*100))


for y in xrange(880,100000,88):
  if y%1024==0:
    print('%d %% 88 == %d   %d %% 1024 == %d'
          % (y,y%88,y,y%1024))

print("\nfile_name1",getsize('file_name1'))
for fn in ("file_name2","file_name3",
           "file_name4","file_name5",
           "file_name6"):
  print(fn,getsize(fn))

result

1  0.492455605391
2  0.503463149646  102.235 %
3  0.475755717556  96.609 %
4  0.449807168229  91.340 %
5  0.319271024669  64.832 %
6  0.334138277351  67.851 %

11264 % 88 == 0   11264 % 1024 == 0
22528 % 88 == 0   22528 % 1024 == 0
33792 % 88 == 0   33792 % 1024 == 0
45056 % 88 == 0   45056 % 1024 == 0
56320 % 88 == 0   56320 % 1024 == 0
67584 % 88 == 0   67584 % 1024 == 0
78848 % 88 == 0   78848 % 1024 == 0
90112 % 88 == 0   90112 % 1024 == 0

file_name1 84000
file_name2 84000
file_name3 84000
file_name4 84000
file_name5 84000
file_name6 84000  

Upvotes: 0

tdelaney
tdelaney

Reputation: 77337

You can increase the file's buffer size. By default, its only 8k and gets flushed a lot.

import random
import time
import string

N = 10
count = 0

start = time.time()
with open('/tmp/xyz','wb',100*(2**20)) as f:
    for i in xrange(1,7000000):
        s = ''.join(random.choice(string.ascii_lowercase) for x in range(N))
        count += len(s)
        f.write(s)
delta = time.time() - start
print count/(2**20), 'mb', count/(delta*(2**20)), 'mbs'

This helps you get large contiguous writes which is generally a good thing but won't help your performance all that much. Try keeping the random.choice() calculation, but leave out the printing in your code - it will still take a long time. You are CPU bound, not IO bound.

Upvotes: 2

seanoftime
seanoftime

Reputation: 186

For what it's worth, here is an example of using a BufferedWriter for this:

import io
import random
import string

N=10
rand_file = io.FileIO("file_name", 'w')
writer = io.BufferedWriter(rand_file,buffer_size=100000000)

for i in range(1, 7000000):
  writer.write(''.join(random.choice(string.ascii_lowercase) for x in range(N)))

writer.flush()

Upvotes: 3

zenpoy
zenpoy

Reputation: 20126

Try this to create a big file, then it should be fast to write to it:

import random

N = 2**20

f = open('rand.txt', 'wb')
f.seek(N-1)
f.write('\0')
f.seek(0)

for i in xrange(N-1):
    f.write(chr(random.randint(32,127)))

f.close()

Upvotes: 0

Related Questions