Reputation: 7288

Seek on a large text file python

I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek and read the corresponding data from each of these files (using Python's file api).

The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.

My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?

Here is what I am doing:

import time

f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1

The delta variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.

Upvotes: 8

Answers (2)

wovano

Reputation: 5092

Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.

NB: You should use time.perf_counter() instead of time.time() for measuring performance. The granularity of time.time() can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.

My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.

According to the documentation:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

You could try to disable buffering by adding the argument buffering=0 to open() and check if that makes a difference:

open(filename, 'r+b', buffering=0)

Upvotes: 1

jslipknot

Reputation: 435

A good way around that could be combining functions from OS module os.open (with flag os.O_RDONLY in your case), os.lseek, os.read which are at low-level I/O

Upvotes: 0

Seek on a large text file python

Answers (2)

Related Questions