Reputation: 7288
I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek
and read the corresponding data from each of these files (using Python's file
api).
The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek
function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.
My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?
Here is what I am doing:
import time
f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1
The delta
variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.
Upvotes: 8
Views: 2842
Reputation: 5092
Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.
NB: You should use time.perf_counter()
instead of time.time()
for measuring performance. The granularity of time.time()
can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.
My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.
According to the documentation:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on
io.DEFAULT_BUFFER_SIZE
. On many systems, the buffer will typically be 4096 or 8192 bytes long.
You could try to disable buffering by adding the argument buffering=0
to open()
and check if that makes a difference:
open(filename, 'r+b', buffering=0)
Upvotes: 1
Reputation: 435
A good way around that could be combining functions from OS module os.open
(with flag os.O_RDONLY
in your case), os.lseek
, os.read
which are at low-level I/O
Upvotes: 0