Reputation: 1616
I'm investigating file I/O performance in Python 3.6.0. Given this script which consists of 3 tests:
#!python3
import random, string, time
strs = ''.join(random.choice(string.ascii_lowercase) for i in range(1000000))
strb = bytes(strs, 'latin-1')
inf = open('bench.txt', 'w+b')
inf.write(strb)
for t in range(3):
inf.seek(0)
inf.read(8191)
for t in range(3):
inf.seek(0)
inf.read(8192)
for t in range(3):
inf.seek(0)
inf.read(8193)
inf.close()
Procmon sees the following operations happening (hashtag lines are my comments):
# Initial write
Offset: 0, Length: 1.000.000
# The 3 8191-long reads only produce one syscall due to caching:
Offset: 0, Length: 8.192
# However, if the read length is exactly 8192, python doesn't take advantage:
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
# Due to caching, the first syscall of the first read of the last loop is missing.
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
# Afterwards, 2 syscalls per read are produced on the 8193-long reads.
First off, it is clear that python will read files in chunks which are multiples of 8KiB.
I'm suspecting that python implements a cache buffer that stores the last read 8KiB block and will simply return it and crop it if you are trying to read only in that same 8KiB extent multiple times consecutively.
Can somebody confirm that it is actually the case that python implements this mechanism?
If that's the case, this means that python cannot detect a change to that block made by an external application if you don't somehow manually invalidate the cache. Is that correct? Perhaps there is a way to disable this mechanism?
Optionally, why is it that exactly 8192 bytes reads cannot benefit from the cache?
Upvotes: 2
Views: 1812
Reputation: 1123620
Yes, the default buffer size is 8k. See io.DEFAULT_BUFFER_SIZE
:
io.DEFAULT_BUFFER_SIZE
Anint
containing the default buffer size used by the module’s buffered I/O classes.open()
uses the file’sblksize
(as obtained byos.stat()
) if possible.
and
>>> import io
>>> io.DEFAULT_BUFFER_SIZE
8192
and the module source code:
#define DEFAULT_BUFFER_SIZE (8 * 1024) /* bytes */
If you use the BufferedIOBase
interface or a wrapper to make changes to the file, the buffer will automatically be updated (opening a file in binary mode produces a BufferedIOBase
subclass, one of BufferedReader
, BufferedWriter
or BufferedRandom
).
For your second case, your seek()
call flushes that buffer because you seeked outside of the 'current' block range (the current position was at 8192
, the first byte of the second buffered block, you seeked back to 0
, which is first byte of the first buffered block). See the source code of BufferedIOBase.seek()
for more details
If you need to edit the underlying file from some other process, using seek()
is a great way to ensure that buffer is dropped before trying to read again, or you could just ignore the buffer and go to the underlying RawIOBase
implementation via the BufferedIOBase.raw
attribute.
Upvotes: 7