Reputation: 4718
I'm trying to better understand how numpy's memmap handles views of very large files. The script below opens a memory mapped 2048^3 array, and copies a downsampled 128^3 view of it
import numpy as np
from time import time
FILE = '/Volumes/BlackBox/test.dat'
array = np.memmap(FILE, mode='r', shape=(2048,2048,2048), dtype=np.float64)
t = time()
for i in range(5):
view = np.array(array[::16, ::16, ::16])
t = ((time() - t) / 5) * 1000
print "Time (ms): %i" % t
Usually, this prints Time (ms): 80
or so. However, if I change the view assignment to
view = np.array(array[1::16, 2::16, 3::16])
and run it three times, I get the following:
Time (ms): 9988
Time (ms): 79
Time (ms): 78
Does anybody understand why the first invocation is so much slower?
Upvotes: 5
Views: 3590
Reputation: 142176
The OS still has portions (or all) of the mapped file available cached in physical RAM. The initial read has to access the disk, which is a lot slower than accessing RAM. Do enough other disk IO, and you'll find that you'll get back closer to your original time, where the OS has to re-read bits it hasn't cached from disk again...
Upvotes: 5