Reputation: 1119
I have a task to create big file with random data. I started doing with the following code:
from __future__ import print_function
N=10
rand_file = open("file_name", 'w');
for i in range(1, 7000000):
print(''.join(random.choice(string.ascii_lowercase)
for x in range(N)),
file=rand_file)
Looking at the write throughput I get on disk with this program I feel this is not the fastest way. I would like to create a 100MB contiguous buffer space, write the strings in the buffer, and then flush it to file every time the buffer fills up. How to do this in python? I looked at io.BufferedWriter, but could not understand how to use it to write into a file.
Any suggestions are welcome. Thanks.
Upvotes: 3
Views: 10719
Reputation: 419
Python isn't necessarily the easiest or the fastest way to create a large file with random data. The following bash snippet will create a file of the specified length with random data; see this question for source.
dd if=/dev/random iflag=fullblock of=$HOME/randomFile bs=1M count=1
Upvotes: -3
Reputation: 27565
Importing ascii_lower
and choice
instead of string
and random
diminishes the execution's times.
It seems that using the with
statement to open the file is cause of a light increasing of execution's time.
Instead of writing 7000 lines (in my code I took this number instead of 7000000) , the idea of the last code is to group a number of lines in a string where they are linked by \n
before to print this string in the file.
Doing so, the number of calls to print()
is lowered.
To obtain the same total number of lines whith a number of grouped lines that isn't a divisor of the total number of lines, it needs to do some tricky computings in the for-loop and the xrange (better undesrtood by seeing the code).
I also choosed a buffer size in order that it is equal to the number of bits in the string grouping several lines, while being a multiple of 1024.
Each line of the file must contain 10 characters. The grouped lines are linked with \n
--> it makes 11 characters. The last grouped line doesn't have a \n
after it but when print()
will act, it will add such a character `\n
.
So for n grouped lines, there are n * 11 characters in the grouping string. As a character==8 bits, it makes n*11*8 = n*88. Then to find n it's easy: it must verify n*88 = buffer_size. We have just to manage to take a buffer_size multiple of 1024 and multiple of 88 at the same time.
It appears that trying to adjust the buffer's size doesn't provide a benefit, it's even the contary !
from __future__ import print_function
from time import clock
from os.path import getsize
N=10
A,B,C,D,E,F = [],[],[],[],[],[]
repet = 20
total_lines = 7000
.
import random
import string
for i in xrange(repet):
te = clock()
rand_file1 = open("file_name1", 'w')
for i in range(total_lines):
print(''.join(random.choice(string.ascii_lowercase)
for x in range(N)),
file=rand_file1)
rand_file1.close()
A.append(clock()-te)
.
import random
import string
for i in xrange(repet):
te = clock()
with open("file_name2", 'w') as rand_file2:
for i in range(total_lines):
print(''.join(random.choice(string.ascii_lowercase)
for x in range(N)),
file=rand_file2)
B.append(clock()-te)
.
import random
from string import ascii_lowercase
for i in xrange(repet):
te = clock()
rand_file3 = open("file_name3", 'w')
for i in range(total_lines):
print(''.join(random.choice(ascii_lowercase)
for x in range(N)),
file=rand_file3)
rand_file3.close()
C.append(clock()-te)
.
from random import choice
from string import ascii_lowercase
for i in xrange(repet):
te = clock()
rand_file4 = open("file_name4", 'w')
for i in range(total_lines):
print(''.join(choice(ascii_lowercase)
for x in range(N)),
file=rand_file4)
rand_file4.close()
D.append(clock()-te)
.
from random import choice
from string import ascii_lowercase
buffer_size = 22528
grouped_lines = buffer_size/(11*8)
for i in xrange(repet):
te = clock()
rand_file5 = open("file_name5", 'w') # <== no buffer's size adjusted here
for i in range(0, total_lines, grouped_lines):
u = '\n'.join(''.join(choice(ascii_lowercase)
for x in range(N))
for y in xrange(min(grouped_lines,total_lines-i)))
print(u,file=rand_file5)
rand_file5.close()
E.append(clock()-te)
.
from random import choice
from string import ascii_lowercase
buffer_size = 22528
grouped_lines = buffer_size/(11*8)
for i in xrange(repet):
te = clock()
rand_file6 = open("file_name6", 'w', buffer_size)
for i in range(0, total_lines, grouped_lines):
u = '\n'.join(''.join(choice(ascii_lowercase)
for x in range(N))
for y in xrange(min(grouped_lines,total_lines-i)))
print(u,file=rand_file6)
rand_file6.close()
F.append(clock()-te)
.
t1,t2,t3,t4,t5,t6=map(min,(A,B,C,D,E,F))
print ('1 %s\n'
'2 %s %.3f %%\n'
'3 %s %.3f %%\n'
'4 %s %.3f %%\n'
'5 %s %.3f %%\n'
'6 %s %.3f %%\n'
% (t1,
t2,t2/t1*100,
t3,t3/t1*100,
t4,t4/t1*100,
t5,t5/t1*100,
t6,t6/t1*100))
for y in xrange(880,100000,88):
if y%1024==0:
print('%d %% 88 == %d %d %% 1024 == %d'
% (y,y%88,y,y%1024))
print("\nfile_name1",getsize('file_name1'))
for fn in ("file_name2","file_name3",
"file_name4","file_name5",
"file_name6"):
print(fn,getsize(fn))
result
1 0.492455605391
2 0.503463149646 102.235 %
3 0.475755717556 96.609 %
4 0.449807168229 91.340 %
5 0.319271024669 64.832 %
6 0.334138277351 67.851 %
11264 % 88 == 0 11264 % 1024 == 0
22528 % 88 == 0 22528 % 1024 == 0
33792 % 88 == 0 33792 % 1024 == 0
45056 % 88 == 0 45056 % 1024 == 0
56320 % 88 == 0 56320 % 1024 == 0
67584 % 88 == 0 67584 % 1024 == 0
78848 % 88 == 0 78848 % 1024 == 0
90112 % 88 == 0 90112 % 1024 == 0
file_name1 84000
file_name2 84000
file_name3 84000
file_name4 84000
file_name5 84000
file_name6 84000
Upvotes: 0
Reputation: 77337
You can increase the file's buffer size. By default, its only 8k and gets flushed a lot.
import random
import time
import string
N = 10
count = 0
start = time.time()
with open('/tmp/xyz','wb',100*(2**20)) as f:
for i in xrange(1,7000000):
s = ''.join(random.choice(string.ascii_lowercase) for x in range(N))
count += len(s)
f.write(s)
delta = time.time() - start
print count/(2**20), 'mb', count/(delta*(2**20)), 'mbs'
This helps you get large contiguous writes which is generally a good thing but won't help your performance all that much. Try keeping the random.choice() calculation, but leave out the printing in your code - it will still take a long time. You are CPU bound, not IO bound.
Upvotes: 2
Reputation: 186
For what it's worth, here is an example of using a BufferedWriter for this:
import io
import random
import string
N=10
rand_file = io.FileIO("file_name", 'w')
writer = io.BufferedWriter(rand_file,buffer_size=100000000)
for i in range(1, 7000000):
writer.write(''.join(random.choice(string.ascii_lowercase) for x in range(N)))
writer.flush()
Upvotes: 3
Reputation: 20126
Try this to create a big file, then it should be fast to write to it:
import random
N = 2**20
f = open('rand.txt', 'wb')
f.seek(N-1)
f.write('\0')
f.seek(0)
for i in xrange(N-1):
f.write(chr(random.randint(32,127)))
f.close()
Upvotes: 0