Reputation: 8378

Why Python loops over slices of numpy arrays are faster than fully vectorized operations

I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:

mask = (data < low) or (data > high)

I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???

In [1]: import numpy as np

In [2]: import sys

In [3]: print(sys.version)
3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.14.0

In [5]: arr = np.random.random((10, 1000, 1000))

In [6]: def method1(arr, low, high):
   ...:     """ Fully vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     np.greater_equal(arr, high, out)
   ...:     np.logical_or(out, arr < low, out)
   ...:     return out
   ...: 

In [7]: def method2(arr, low, high):
   ...:     """ Partially vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     for k in range(arr.shape[0]):
   ...:         a = arr[k]
   ...:         o = out[k]
   ...:         np.greater_equal(a, high, o)
   ...:         np.logical_or(o, a < low, o)
   ...:     return out
   ...:

First of all, let's make sure that both methods produce identical results:

In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))
Out[8]: True

And now some timing tests:

In [9]: %timeit method1(arr, 0.2, 0.8)
14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit method2(arr, 0.2, 0.8)
11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is going on here?

EDIT 1: A similar behavior is observed in an older environment:

In [3]: print(sys.version)
2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.11.3

In [9]: %timeit method1(arr, 0.2, 0.8)
100 loops, best of 3: 14.3 ms per loop

In [10]: %timeit method2(arr, 0.2, 0.8)
100 loops, best of 3: 13 ms per loop

Upvotes: 8

Answers (2)

max9111

Reputation: 6482

Outperforming both methods

In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.

Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.

To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.

import numba as nb
@nb.njit(fastmath=True, cache=True)
def method3(arr, low, high):
  out = np.empty(arr.shape, dtype=nb.boolean)
  for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
      for k in range(arr.shape[2]):
        out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high
  return out

Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)

This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.

Upvotes: 3

Dennis Soemers

Reputation: 8488

In my own tests, the difference in performance was even more noticeable than in your question. The differences continued to be clearly observable after increasing second and third dimensions of the arr data. It also continued to be observable after commenting out one of the two comparison functions (greater_equal or logical_or), which means we can rule out some kind of strange interaction between the two.

By changing the implementation of the two methods to the following, I could significantly reduce the observable difference in performance (but not completely eliminate it):

def method1(arr, low, high):
    out = np.empty(arr.shape, dtype=np.bool)
    high = np.ones_like(arr) * high
    low = np.ones_like(arr) * low
    np.greater_equal(arr, high, out)
    np.logical_or(out, arr < low, out)
    return out

def method2(arr, low, high):
    out = np.empty(arr.shape, dtype=np.bool)
    high = np.ones_like(arr) * high
    low = np.ones_like(arr) * low
    for k in range(arr.shape[0]):
        a = arr[k]
        o = out[k]
        h = high[k]
        l = low[k]
        np.greater_equal(a, h, o)
        np.logical_or(o, a < l, o)
    return out

I suppose that, when supplying high or low as a scalar to those numpy functions, they may internally first create a numpy array of the correct shape filled with that scalar. When we do this manually outside the functions, in both cases only once for the full shape, the performance difference becomes much less noticeable. This implies that, for whatever reason (maybe cache?), creating such a large array filled with the same constant once may be less efficient than creating k smaller arrays with the same constant (as done automatically by the implementation of method2 in the original question).

Note: in addition to reducing the performance gap, it also makes the performance of both methods much worse (affecting the second method more severly than the first). So, while this may give some indication of where the issue may be, it doesn't appear to explain everything.

EDIT

Here is a new version of method2, where we now manually pre-create smaller arrays within the loop every time, like I suspect is happening internally in numpy in the original implementation in the question:

def method2(arr, low, high):
    out = np.empty(arr.shape, dtype=np.bool)
    for k in range(arr.shape[0]):
        a = arr[k]
        o = out[k]
        h = np.full_like(a, high)
        l = np.full_like(a, low)
        np.greater_equal(a, h, o)
        np.logical_or(o, a < l, o)
    return out

This version is indeed much faster again than the one I have above (confirming that creating many smaller arrays inside the loop is more efficient than one big one outside the loop), but still slower than the original implementation in the question.

Under the hypothesis that these numpy functions are indeed converting the scalar bounds into these kinds of arrays first, the difference in performance between this last function and the one in the question could be due to creation of arrays in Python (my implementation) vs. doing so natively (original implementation)

Upvotes: 0

Why Python loops over slices of numpy arrays are faster than fully vectorized operations

Answers (2)

Related Questions