Jee Seok Yoon
Jee Seok Yoon

Reputation: 4806

Numpy memmap better IO and memory usage

Currently I am working with a NumPy memmap array with 2,000,000 * 33 * 33 *4 (N * W * H * C) data. My program reads random (N) indices from this array.

I have 8GB of RAM, 2TB HDD. The HDD read IO is only around 20M/s, RAM usage stays at 2.5GB. It seems that there is a HDD bottleneck because I am retrieving random indices that are obviously not in the memmap cache. Therefore, I would like the memmap cache to use RAM as much as possible.

Is there a way for me to tell memmap to maximize IO and RAM usage?

Upvotes: 2

Views: 591

Answers (1)

ntg
ntg

Reputation: 14095

(Checking my python 2.7 source) As far as I can tell NumPy memmap uses mmap. mmap does define:

# Variables with simple values
...
ALLOCATIONGRANULARITY = 65536
PAGESIZE = 4096

However i am not sure it would be wise (or even possible) to change those. Furthermore, this may not solve your problem and would definitely not give you the most efficient solution, because there is caching and page reading at OS level and at hardware level (because for hardware it takes more or less the same time to read a single value or the whole page).

A much better solution would probably be to sort your requests. (I suppose here that N is large, otherwise just sort them once): Gather a bunch of them (say one or ten millions?) and before doing the request, sort them. Then ask the ordered queries. Then after getting the answers put them back in their original order...

Upvotes: 2

Related Questions