Reputation: 83
I'm writing a python script in which I read a big file ~5 GB line by line, make some modifications in each of the lines, and then write it to another file.
When I use the function file.readlines() for reading the input file, my disk usage reaches ~90% and the disk speed reaches +100Mbps (i know this method shouldn't be used for large files).
I haven't measured the program execution time for the above case as my system becomes unresponsive (the memory gets full).
When I use an iterator like below (And this is what I'm actually using in my code)
with open('file.csv', 'r') as inFile:
for line in inFile:
My disk usage remains < 10% and speed are < 5 Mbps and it takes ~20 minutes for the program to finish execution for a 5 GB file. Wouldn't this time be lower if my disk usage was high?
Also, does it really take ~20 minutes to read a 5 GB, process it line by line making some modifications on each line and finally writing it to a new file or am I doing something wrong?
What I can't figure out is why doesn't the program use my system to its full potential when performing the io operations. Because if it did, then my disk usage should have been higher, right?.
Upvotes: 2
Views: 1065
Reputation: 1482
I don't think your main problem is reading the file because you're using open(), instead I would check what you are doing here:
make some modifications in each of the lines, and then write it to another file.
So, try reading the file without making modifications / writting modifications to another file to find out how much it takes for your system to just read the file.
Here's how I tested in my environment after reading this,this,this and this
First, created a 1.2GB file:
timeout 5 yes "Ergnomic systems for c@ts that works too much" >> foo
I didn't use dd or truncate, that would lead to Memory Errors while reading the files.
Now some I/O testing reading the file, this is an already optimized operation like @Serge Ballesta mentioned:
#!/usr/bin/python
with open('foo') as infile:
for line in infile:
pass
print 'file readed'
$ time python io_test.py
file readed
real 0m2.647s
user 0m2.343s
sys 0m0.327s
Changing buffering options with open():
# --------------------------------------NO BUFFERING
with open('foo','r',0) as infile:
for line in infile:
pass
print 'file readed'
$ time python io_test.py
file readed
real 0m2.787s
user 0m2.406s
sys 0m0.374s
# --------------------------------------ONE LINE BUFFERED
with open('foo','r',1) as infile:
for line in infile:
pass
print 'file readed'
$ time python io_test.py
file readed
real 0m4.331s
user 0m2.468s
sys 0m1.811s
# -------------------------------------- 70 MB/s
with open('foo','r',700000000) as infile:
for line in infile:
pass
print 'file readed'
$ time python io_test.py
file readed
real 0m3.137s
user 0m2.311s
sys 0m0.827s
Why you should not use readlines:
with open('foo') as f:
lines = f.readlines()
for line in lines:
pass
$ time python io_test.py
real 0m6.428s
user 0m3.858s
sys 0m2.499s
Upvotes: 1
Reputation: 148965
Reading a file by line in Python is already an optimized operation: Python loads an internal buffer from the disk and gives it in lines to the caller. That means that the line identification is already done in memory by the Python interpretor.
Normally, a processing can be disk IO bound, when disk access is the limiting factor, memory bound or processor bound. If you use some network, it can be network IO bound or remote server bound, still depending on what is the limiting factor. As you process the file by line, it is quite unlikely for the process to be memory bound. To make sure whether the disk IO is the limiting part, you could try to simply copy the file with the system copy utility. If it takes about 20 minutes, then the process is disk IO bound, if it is much quicker then the modification on the lines cannot be neglected.
Anyway, loading a big file in memory is always a bad idea...
Upvotes: 1
Reputation: 1
You not only need RAM for the file, but also for input- and output-buffers and a 2nd copy of your modified file. This is easily overwhelming your RAM. If you do not want to read, modify write each single line in a for loop, you may want to group some lines together. This will probably make reading/writing faster, but at the cost of some more algorithmic overhead. At the end of the day I'd use the line-by-line approach. HTH! LuI
Upvotes: 0
Reputation: 1185
It simply depends on the size buffer you use for reading the file.
Lets look at an example:
You have a file which contains 20 characters.
Your buffer size is 2 characters.
Then you have to use at least 10 system calls for reading the entire time.
A system call is a very expensive operation because the kernel has to switch the executing context.
If you have a buffer which is 20 characters in size you just need 1 system call and therefore only one kernel trap is nescessary.
I assume that the first function simply uses a bigger buffer internally.
Upvotes: 0