Mike Azatov
Mike Azatov

Reputation: 442

Efficient numpy slicing of a large 3D array

I have a large 3D numpy array lookup = np.random.rand((1000,1000,1000)). It represents 1000 images of resolution (1000,2000). For every image I'm trying to get a list of values at different locations. I have location array, m = np.random.rand(1000,2)*1000; m = m.astype('int') .

I'm getting values of every slice at those values (see example code below)

%timeit lookup[:,m[:,1], m[:,0]]

This operation ends up being surprisingly slow. I get about 20 ms on my laptop. I would want it to be 1-2 orders of magnitude faster. Is there a better way to slice this type of numpy array?

Upvotes: 2

Views: 1501

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50277

This computation is not slow because of Numpy, but because of your hardware, and actually on most modern hardware. The main reason it is slow is because of random RAM accesses resulting in a latency-bound computation.

The input array is big and thus cannot be stored in CPU cache but in RAM. Modern RAM can have a quite high throughput but each fetch require a pretty big latency (about 80ns per random fetch on recent x86 processors). While the throughput of new RAM devices tends to significantly improve over time, the it is barely the case for latency. Fetching one (8-bytes) double-precision floating-point number at a time sequentially would result in a throughput of size_of_double/RAM_latency = 8/80e-9 ≈ 95 MiB/s. This is a fraction of what modern mainstream RAM devices can do (dozens of GiB/s)

To solve this problem, modern processors can fetch several memory block at a time and try to predict RAM accesses and load data ahead of time using prefetching units and speculation. This works well for predictible access patterns (especially contiguous loads) but not at all with a random access pattern like in your code. Still modern processors succeed to fetch multiple memory blocks in parallel with on a sequential code, but not enough so this kind of code can be fast enough (the throughput of your code is about 400 MiB/s). Not to mention mainstream x86 processors load systematically cache-lines of 64-bytes from RAM devices while you only need 8-bytes.

One solution is to parallelize this operation. However, this is not very efficient because of cache-lines (you will barely get more than 10% of the maximal throughput).

An alternative solution is to transpose your input data so that so that the fetched memory blocks can be more contiguous. Here is an example:

transposedLookup = np.array(lookup.T)
%timeit transposedLookup[m[:,0], m[:,1],:].T

Note that the first transposition will be rather slow (mainly because it is not yet optimized by Numpy, but also because of the access pattern) and requires twice the amount of RAM. You can use this solution to speed up the transposition. It is also possible to transpose the input matrix in-place if it is cubic. It would be even better if you could generate the array directly in its transposed form.

Also note that the second transposition is fast because it is done lazily, but even an eagerly transposition is still several times faster than the original code.

Here are the timings on my machine:

Original code: 14.1 ms
Eager Numpy transposition: 2.6 ms
Lazy Numpy transposition: 0.6 ms

EDIT: note that the same thing apply for m.

Upvotes: 5

Related Questions