numpy: copying value defaults on integer indexing vs boolean indexing

Question

I have recently started studying McKinney's Python for data analysis. This tripped me up in the book:

Array slices are views on the original array. This means data is not copied and any modifications to the view will be reflected in the source array ... As NumPy has been designed with large data use case in mind, you could imagine performance and memory problems if NumPy insisted on copying data left to right.

Fine. Seems like a sensible design choice. But two pages later it says:

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

Wait, what? Also,

You can even mix and match boolean arrays with slices ... e.g. data[names == 'Bob', 2:]

Now what would that return? A view on a copy of the data? And why is this behavior the way it is? Coming from R, I see boolean indexing and location based indexing equally frequently used techniques. If NumPy has been designed to avoid copying memory, what drives this design choice?

Thanks.

Davidmh · Accepted Answer

Let's assume a 1D array. The data in memory would look something like:

10 | 11 | 12 | 13 | 14 | 15 | 16

Accessing an element by index is trivial. Just take the position of the first element, and jump n steps. So, for arr[2]:

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^

I can get the position in memory with just one multiplication. Fast and easy.

I can do a slice, and say "take only arr2 = arr[2:-1]":

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^----^----^----^

Now, the memory layout is very similar. Getting an element is a multiplication from a new starting point. arr2[1]:

10 | 11 | 12 | 13 | 14 | 15 | 16
  (ignore) -----^----------

You can do a fancier trick, and say arr3 = arr[::2], take all the elements jumping one each.

10 | 11 | 12 | 13 | 14 | 15 | 16
 ^---------^---------^---------^

Again, getting indexes of arr3 is very simple: just do a multiplication, but now the size is bigger. This is what strides are for, they tell you the sizes of the blocks and how to get elements by indexing. Strides are even more powerful in more dimensions. This is, by the way, the way we can turn memory (1D) into a matrix (2D).

Now, we get to boolean arrays. If my mask is: T F T T F F T and I ask you for the third element, you would need to transvers the mask, find which is the third true, and then get its index; thus, very slow. So, when taking a boolean mask we have to make a copy of the data. There are some masks than can be represented with strides, but not in general, so for consistency, always a copy.

As a side note, sometimes, the cost of making a copy is worth performance-wise. If you want to do many operations reading "every fifth element of an array", the data in memory will not be aligned, so the CPU will have to wait for it to be fetched every time. It would then be faster to make a single copy (will be continuous), and work with it.

numpy: copying value defaults on integer indexing vs boolean indexing

Answers (2)

Related Questions