Reputation: 1063
I need to generate a very large text file. Each line has a simple format:
Seq_num<SPACE>num_val
12343234 759
Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val
, and then I write that to a file.
This approach takes a lot of time.
## APPROACH 1
for seq_id in seq_ids:
num_val=rand()
line=seq_id+' '+num_val
data_file.write(line)
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val
, and then I append this to a list.
When loop finishes, I iterate over list items and write each item to a file.
This approach takes far less time.
## APPROACH 2
data_lines=list()
for seq_id in seq_ids:
num_val=rand()
l=seq_id+' '+num_val
data_lines.append(l)
for line in data_lines:
data_file.write(line)
Note that:
So approach 1 must take less time. Any hints what I am missing?
Upvotes: 13
Views: 9222
Reputation: 3954
Considering APPROACH 2, I think I can assume you have the data for all the lines (or at least in big chunks) before you need to write it to the file.
The other answers are great and it was really formative to read them, but both focused on optimizing the file writing or avoiding the first for loop replacing with list comprehension (that is known to be faster).
They missed the fact that you are iterating in a for loop to write the file, which is not really necessary.
Instead of doing that, by increasing the use of memory (in this case is affordable, since a 100 million line file would be about 600 MB), you can create just one string in a more efficient way by using the formatting or join features of python str, and then write the big string to the file. Also relying on list comprehension to get the data to be formatted.
With loop1 and loop2 of @Tombart 's answer, I get elapsed time 0:00:01.028567
and elapsed time 0:00:01.017042
, respectively.
While with this code:
start = datetime.now()
data_file = open('file.txt', 'w')
data_lines = ( '%i %f\n'%(seq_id, random.random())
for seq_id in xrange(0, 1000000) )
contents = ''.join(data_lines)
data_file.write(contents)
end = datetime.now()
print("elapsed time %s" % (end - start))
I get elapsed time 0:00:00.722788
which is about a 25% faster.
Notice that data_lines
is a generator expression, so the list is not really stored in memory, and the lines are generated and consumed on demand by the join
method. This implies the only variable that is significantly occupying memory is contents
. This also reduces slightly the running times.
If the text is to large to do all the work in memory, you can always separate in chunks. That is, formatting the string and writing to the file every million lines or so.
Conclusions:
filter
for filtering lists see here).format
or join
functions.for
loops. For example, using extend
function of a list instead of iterating and using append
. In fact, both previous points can be seen as examples of this remark.Remark. Although this answer can be considered useful on its own, it does not completely address the question, which is why the two loops option in the question seems to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can bring some light on that matter.
Upvotes: 6
Reputation: 5463
Below is an extension to the elegant answer by @Tombart and a few further observations.
With one goal in mind: optimizing the process of reading data from loop(s) and then writing it into a file, let's begin:
I will use the with
statement to open/close the file test.txt
in all cases. This statement automatically closes the file when the code block within it is executed.
Another important point to consider is the way Python processes text files based on Operating system. From the docs:
Note: Python doesn’t depend on the underlying operating system’s notion of text files; all the processing is done by Python itself, and is therefore platform-independent.
This means that these results may only slightly vary when executed on a Linux/Mac or Windows OS. The slight variation may result from other processes using the same file at the same time or multiple IO processes happening on the file during the script execution, general CPU processing speed among others.
I present 3 cases with execution times for each and finally find a way to further optimize the most efficient and quick case:
First case: Loop over range(1,1000000) and write to file
import time
import random
start_time = time.time()
with open('test.txt' ,'w') as f:
for seq_id in range(1,1000000):
num_val = random.random()
line = "%i %f\n" %(seq_id, num_val)
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.6448447704315186 seconds
Note: In the two list
scenarios below, I have initialized an empty list data_lines
like:[]
instead of using list()
. The reason is: []
is about 3 times faster than list()
. Here's an explanation for this behavior: Why is [] faster than list()?. The main crux of the discussion is: While []
is created as bytecode objects and is a single instruction, list()
is a separate Python object that also needs name resolution, global function calls and the stack has to be involved to push arguments.
Using the timeit() function in the timeit module, here's the comparison:
import timeit import timeit
timeit.timeit("[]") timeit.timeit("list()")
#0.030497061136874608 #0.12418613287039193
Second Case: Loop over range(1,1000000), append values to an empty list and then write to file
import time
import random
start_time = time.time()
data_lines = []
with open('test.txt' ,'w') as f:
for seq_id in range(1,1000000):
num_val = random.random()
line = "%i %f\n" %(seq_id, num_val)
data_lines.append(line)
for line in data_lines:
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.6988046169281006 seconds
Third Case: Loop over a list comprehension and write to file
With Python's powerful and compact list comprehensions, it is possible to optimize the process further:
import time
import random
start_time = time.time()
with open('test.txt' ,'w') as f:
data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
for line in data_lines:
f.write(line)
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 2.464804172515869 seconds
On multiple iterations, I've always received a lower execution time value in this case as compared to the previous two cases.
#Iteration 2: Execution time: 2.496004581451416 seconds
Now the question arises: why are list comprehensions( and in general lists ) faster over sequential for
loops?
An interesting way to analyze what happens when sequential for
loops execute and when list
s execute, is to dis
assemble the code
object generated by each and examine the contents. Here is an example of a list comprehension code object disassembled:
#disassemble a list code object
import dis
l = "[x for x in range(10)]"
code_obj = compile(l, '<list>', 'exec')
print(code_obj) #<code object <module> at 0x000000058DA45030, file "<list>", line 1>
dis.dis(code_obj)
#Output:
<code object <module> at 0x000000058D5D4C90, file "<list>", line 1>
1 0 LOAD_CONST 0 (<code object <listcomp> at 0x000000058D5D4ED0, file "<list>", line 1>)
2 LOAD_CONST 1 ('<listcomp>')
4 MAKE_FUNCTION 0
6 LOAD_NAME 0 (range)
8 LOAD_CONST 2 (10)
10 CALL_FUNCTION 1
12 GET_ITER
14 CALL_FUNCTION 1
16 POP_TOP
18 LOAD_CONST 3 (None)
20 RETURN_VALUE
Here's an example of a for
loop code object disassembled in a function test
:
#disassemble a function code object containing a `for` loop
import dis
test_list = []
def test():
for x in range(1,10):
test_list.append(x)
code_obj = test.__code__ #get the code object <code object test at 0x000000058DA45420, file "<ipython-input-19-55b41d63256f>", line 4>
dis.dis(code_obj)
#Output:
0 SETUP_LOOP 28 (to 30)
2 LOAD_GLOBAL 0 (range)
4 LOAD_CONST 1 (1)
6 LOAD_CONST 2 (10)
8 CALL_FUNCTION 2
10 GET_ITER
>> 12 FOR_ITER 14 (to 28)
14 STORE_FAST 0 (x)
6 16 LOAD_GLOBAL 1 (test_list)
18 LOAD_ATTR 2 (append)
20 LOAD_FAST 0 (x)
22 CALL_FUNCTION 1
24 POP_TOP
26 JUMP_ABSOLUTE 12
>> 28 POP_BLOCK
>> 30 LOAD_CONST 0 (None)
32 RETURN_VALUE
The above comparison shows more "activity", if I may, in the case of a for
loop. For instance, notice the additional function calls to the append()
method in thefor
loop function call. To know more about the parameters in the dis
call output, here's the official documentation.
Finally, as suggested before, I also tested with file.flush()
and the execution time is in excess of 11 seconds
. I add f.flush() before the file.write()
statement:
import os
.
.
.
for line in data_lines:
f.flush() #flushes internal buffer and copies data to OS buffer
os.fsync(f.fileno()) #the os buffer refers to the file-descriptor(fd=f.fileno()) to write values to disk
f.write(line)
The longer execution time using flush()
can be attributed to the way data is processed. This function copies the data from the program buffer to the operating system buffer. This means that if a file(say test.txt
in this case), is being used by multiple processes and large chunks of data is being added to the file, you will not have to wait for the whole data to be written to the file and the information will be readily available. But to make sure that the buffer data is actually written to disk, you also need to add: os.fsync(f.fileno())
. Now, adding os.fsync()
increases the execution time atleast 10 times(I didn't sit through the whole time!) as it involves copying data from buffer to hard disk memory. For more details, go here.
Further Optimization: It is possible to further optimize the process. There are libraries available that support multithreading
, create Process Pools
and perform asynchronous
tasks . This is particularly useful when a function performs a CPU-intensive task & writes to file at the same time. For instance, a combination of threading
and list comprehensions
gives the fastest possible result(s):
import time
import random
import threading
start_time = time.time()
def get_seq():
data_lines = ["%i %f\n" %(seq_id, random.random()) for seq_id in range(1,1000000)]
with open('test.txt' ,'w') as f:
for line in data_lines:
f.write(line)
set_thread = threading.Thread(target=get_seq)
set_thread.start()
print('Execution time: %s seconds' % (time.time() - start_time))
#Execution time: 0.015599966049194336 seconds
Conclusion: List comprehensions offer better performance in comparison to sequential for
loops and list
append
s. The primary reason behind this is single instruction bytecode execution in the case of list comprehensions which is faster than the sequential iterative calls to append items to list as in the case of for
loops. There is scope for further optimization using asyncio, threading & ProcessPoolExecutor(). You could also use combination of these to achieve faster results. Using file.flush()
depends upon your requirement. You may add this function when you need asynchronous access to data when a file is being used by multiple processes. Although, this process may take a long time if you are also writing the data from the program's buffer memory to the OS's disk memory using os.fsync(f.fileno())
.
Upvotes: 14
Reputation: 32378
A lot and far less are technically very vague terms :) Basically if you can't measure it, you can't improve it.
For simplicity let's have a simple benchmark, loop1.py
:
import random
from datetime import datetime
start = datetime.now()
data_file = open('file.txt', 'w')
for seq_id in range(0, 1000000):
num_val=random.random()
line="%i %f\n" % (seq_id, num_val)
data_file.write(line)
end = datetime.now()
print("elapsed time %s" % (end - start))
loop2.py
with 2 for loops:
import random
from datetime import datetime
start = datetime.now()
data_file = open('file.txt', 'w')
data_lines=list()
for seq_id in range(0, 1000000):
num_val=random.random()
line="%i %f\n" % (seq_id, num_val)
data_lines.append(line)
for line in data_lines:
data_file.write(line)
end = datetime.now()
print("elapsed time %s" % (end - start))
When I run these two scripts on my computers (with SSD drive) I'm getting something like:
$ python3 loop1.py
elapsed time 0:00:00.684282
$ python3 loop2.py
elapsed time 0:00:00.766182
Each measurement might be slightly different, but as would intuition suggest, the second one is slightly slower.
If we want to optimize writing time, we need to check the manual how Python implements writing into files. For text files the open()
function should use BufferedWriter
.The open
function accepts 3rd arguments that is the buffer size. Here's the interesting part:
Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
So, we can modify the loop1.py
and use line buffering:
data_file = open('file.txt', 'w', 1)
this turns out to be very slow:
$ python3 loop3.py
elapsed time 0:00:02.470757
In order to optimize writing time, we can adjust the buffer size to our needs. First we check the line size in bytes: len(line.encode('utf-8'))
, that gives me 11
bytes.
After updating the buffer size to our expected line size in bytes:
data_file = open('file.txt', 'w', 11)
I'm getting quite fast writes:
elapsed time 0:00:00.669622
Based on the details you've provided it's hard to estimate what's going on. Maybe the heuristic for estimating block size doesn't work well on your computer. Anyway if you're writing fixed line length, it's easy to optimize the buffer size. You could further optimize writing to files by leveraging flush()
.
Conclusion: Generally for faster writes into a file you should try to write a bulk of data that corresponds to a block size on your file system - which is exactly what is the Python method open('file.txt', 'w')
is trying to do. In most cases you're safe with the defaults, differences in microbenchmarks are insignificant.
You're allocating large number of string objects, that needs to be collected by the GC. As suggested by @kevmo314, in order to perform a fair comparison you should disable the GC for loop1.py
:
gc.disable()
As the GC might try to remove string objects while iterating over the loop (you're not keeping any reference). While the seconds approach keeps references to all string objects and GC collects them at the end.
Upvotes: 17
Reputation: 593
The other answers here give good advice, but I think the actual problem may be different:
I think the real issue here is the generational garbage collector is running more often with the single-loop code. The generational GC exists alongside the refcounting system, to periodically check for orphaned objects with nonzero self/cyclic-references.
The reason why this would be happening is probably complex, but my best guess is this:
With the single-loop code, each iteration is implicitly allocating a new string, then sending it off to be written to a file, after which it is abandoned, its refcount goes to zero, and thus it is deallocated. I believe cumulative alloc/dealloc traffic is part of the heuristic that decides when GC is done, so this behavior would be sufficient to set that flag every so many iterations. The flag, in turn, is probably checked any time your thread is going to be forced to wait for something anyway, because that's an excellent opportunity to fill wasted time with a garbage collection. Synchronous file writes are exactly that kind of opportunity.
With the dual-loop code, you are creating a string and adding it to the list, over and over, nothing else. Allocate, allocate, allocate. If you run out of memory, you're going to trigger a GC, but otherwise I doubt you're doing anything that's set up to check for opportunities to GC. There's nothing there to cause a thread wait, a context switch, etc. The second loop calls into the synchronous file I/O, where I think opportunistic GC can occur, but only the first call might trigger one, because there is no further memory allocation/deallocation at that point. Only after the entire list is written is the list itself deallocated, all at once.
I'm not in a position to test the theory myself right now, unfortunately, but you could try disabling the generational garbage collection and see whether or not it changes the execution speed of the single-loop version:
import gc
gc.disable()
I think that's all you'd need to do to confirm or disprove my theory.
Upvotes: 2
Reputation: 3318
It could reduce time cost around half by changing the follows
for line in data_lines:
data_file.write(line)
into:
data_file.write('\n'.join(data_lines))
Here is my test run range(0, 1000000)
elapsed time 0:00:04.653065
elapsed time 0:00:02.471547
2.471547 / 4.653065 = 53 %
However if 10 times the above range, there is no much difference.
Upvotes: 0