noobcoder
noobcoder

Reputation: 83

Does high disk usage mean faster file read/write operations?

I'm writing a python script in which I read a big file ~5 GB line by line, make some modifications in each of the lines, and then write it to another file.

When I use the function file.readlines() for reading the input file, my disk usage reaches ~90% and the disk speed reaches +100Mbps (i know this method shouldn't be used for large files).

I haven't measured the program execution time for the above case as my system becomes unresponsive (the memory gets full).

When I use an iterator like below (And this is what I'm actually using in my code)

with open('file.csv', 'r') as inFile:
    for line in inFile:

My disk usage remains < 10% and speed are < 5 Mbps and it takes ~20 minutes for the program to finish execution for a 5 GB file. Wouldn't this time be lower if my disk usage was high?

Also, does it really take ~20 minutes to read a 5 GB, process it line by line making some modifications on each line and finally writing it to a new file or am I doing something wrong?

What I can't figure out is why doesn't the program use my system to its full potential when performing the io operations. Because if it did, then my disk usage should have been higher, right?.

Upvotes: 2

Views: 1065

Answers (4)

Miguel Ortiz
Miguel Ortiz

Reputation: 1482

I don't think your main problem is reading the file because you're using open(), instead I would check what you are doing here:

make some modifications in each of the lines, and then write it to another file.

So, try reading the file without making modifications / writting modifications to another file to find out how much it takes for your system to just read the file.

Here's how I tested in my environment after reading this,this,this and this

First, created a 1.2GB file:

timeout 5 yes "Ergnomic systems for c@ts that works too much" >> foo

I didn't use dd or truncate, that would lead to Memory Errors while reading the files.

Now some I/O testing reading the file, this is an already optimized operation like @Serge Ballesta mentioned:

#!/usr/bin/python
with open('foo') as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.647s
user    0m2.343s
sys     0m0.327s

Changing buffering options with open():

# --------------------------------------NO BUFFERING
with open('foo','r',0) as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.787s
user    0m2.406s
sys     0m0.374s

# --------------------------------------ONE LINE BUFFERED
with open('foo','r',1) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m4.331s
user    0m2.468s
sys     0m1.811s
# -------------------------------------- 70 MB/s
with open('foo','r',700000000) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m3.137s
user    0m2.311s
sys     0m0.827s

Why you should not use readlines:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

$ time python io_test.py

real    0m6.428s
user    0m3.858s
sys     0m2.499s

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148965

Reading a file by line in Python is already an optimized operation: Python loads an internal buffer from the disk and gives it in lines to the caller. That means that the line identification is already done in memory by the Python interpretor.

Normally, a processing can be disk IO bound, when disk access is the limiting factor, memory bound or processor bound. If you use some network, it can be network IO bound or remote server bound, still depending on what is the limiting factor. As you process the file by line, it is quite unlikely for the process to be memory bound. To make sure whether the disk IO is the limiting part, you could try to simply copy the file with the system copy utility. If it takes about 20 minutes, then the process is disk IO bound, if it is much quicker then the modification on the lines cannot be neglected.

Anyway, loading a big file in memory is always a bad idea...

Upvotes: 1

LuI
LuI

Reputation: 1

You not only need RAM for the file, but also for input- and output-buffers and a 2nd copy of your modified file. This is easily overwhelming your RAM. If you do not want to read, modify write each single line in a for loop, you may want to group some lines together. This will probably make reading/writing faster, but at the cost of some more algorithmic overhead. At the end of the day I'd use the line-by-line approach. HTH! LuI

Upvotes: 0

Appyx
Appyx

Reputation: 1185

It simply depends on the size buffer you use for reading the file.

Lets look at an example:

You have a file which contains 20 characters.

Your buffer size is 2 characters.

Then you have to use at least 10 system calls for reading the entire time.

A system call is a very expensive operation because the kernel has to switch the executing context.

If you have a buffer which is 20 characters in size you just need 1 system call and therefore only one kernel trap is nescessary.

I assume that the first function simply uses a bigger buffer internally.

Upvotes: 0

Related Questions