user2464424
user2464424

Reputation: 1616

Does Python have a 8KiB bytes long file I/O cache?

I'm investigating file I/O performance in Python 3.6.0. Given this script which consists of 3 tests:

#!python3

import random, string, time

strs = ''.join(random.choice(string.ascii_lowercase) for i in range(1000000))
strb = bytes(strs, 'latin-1')

inf = open('bench.txt', 'w+b')
inf.write(strb)

for t in range(3):
    inf.seek(0)
    inf.read(8191)

for t in range(3):
    inf.seek(0)
    inf.read(8192)

for t in range(3):
    inf.seek(0)
    inf.read(8193)

inf.close()

Procmon sees the following operations happening (hashtag lines are my comments):

  # Initial write
Offset: 0, Length: 1.000.000
  # The 3 8191-long reads only produce one syscall due to caching:
Offset: 0, Length: 8.192
  # However, if the read length is exactly 8192, python doesn't take advantage:
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
  # Due to caching, the first syscall of the first read of the last loop is missing.
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
 # Afterwards, 2 syscalls per read are produced on the 8193-long reads.

First off, it is clear that python will read files in chunks which are multiples of 8KiB.

I'm suspecting that python implements a cache buffer that stores the last read 8KiB block and will simply return it and crop it if you are trying to read only in that same 8KiB extent multiple times consecutively.

Can somebody confirm that it is actually the case that python implements this mechanism?

If that's the case, this means that python cannot detect a change to that block made by an external application if you don't somehow manually invalidate the cache. Is that correct? Perhaps there is a way to disable this mechanism?

Optionally, why is it that exactly 8192 bytes reads cannot benefit from the cache?

Upvotes: 2

Views: 1812

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123620

Yes, the default buffer size is 8k. See io.DEFAULT_BUFFER_SIZE:

io.DEFAULT_BUFFER_SIZE
An int containing the default buffer size used by the module’s buffered I/O classes. open() uses the file’s blksize (as obtained by os.stat()) if possible.

and

>>> import io
>>> io.DEFAULT_BUFFER_SIZE
8192

and the module source code:

#define DEFAULT_BUFFER_SIZE (8 * 1024)  /* bytes */

If you use the BufferedIOBase interface or a wrapper to make changes to the file, the buffer will automatically be updated (opening a file in binary mode produces a BufferedIOBase subclass, one of BufferedReader, BufferedWriter or BufferedRandom).

For your second case, your seek() call flushes that buffer because you seeked outside of the 'current' block range (the current position was at 8192, the first byte of the second buffered block, you seeked back to 0, which is first byte of the first buffered block). See the source code of BufferedIOBase.seek() for more details

If you need to edit the underlying file from some other process, using seek() is a great way to ensure that buffer is dropped before trying to read again, or you could just ignore the buffer and go to the underlying RawIOBase implementation via the BufferedIOBase.raw attribute.

Upvotes: 7

Related Questions