numpy.array slicing behaviour

Question

Why does numpy.array behave differently than Python's list and default arrays when it comes to slicing? Please consider the examples below:

1) using lists: the statement b = a[1:3] creates a new list object, and modifying b does not modify a.

>>> a = [1,2,3,4]
>>> b = a[1:3]
>>> print(b)
[2, 3]
>>> b[0] = -17
>>> print(b)
[-17, 3]
>>> print(a)
[1, 2, 3, 4]

2) using array.array: the statement b = a[1:3] once again creates a new array object, and modifying b does not modify a.

>>> import array
>>> a = array.array('i', [1,2,3,4])
>>> b = a[1:3]
>>> print(b)
array('i', [2, 3])
>>> b[0] = -17
>>> print(b)
array('i', [-17, 3])
>>> print(a)
array('i', [1, 2, 3, 4])

3) using numpy.array: the statement b = a[1:3] seems to reference the values of the original list, and modifying it does modify a as well!

>>> import numpy
>>> a = numpy.array([1,2,3,4])
>>> b = a[1:3]
>>> print(b)
[2 3]
>>> b[0] = -17
>>> print(b)
[-17   3]
>>> print(a)
[  1 -17   3   4]

The question is: why is this behaviour present in numpy?

Alex Huszagh · Accepted Answer

Because NumPy is a high-performance data collection. For Python to create a new list, it must construct a new list, increment all pointers to each element in the list, add the item to the list, and then return the slice. NumPy (likely) simply increments the offset of the start array and changes the end of the array.

NumPy slicing

Think of a NumPy array as something like this (yes, this is highly oversimplified):

struct array
{
    size_t type_size;
    size_t length
    void* start;
};

If you don't know C, then basically means that an array can be thought of as an address to memory, designating the start of the array, it stores the size of each type it is storing, and then the length of the buffer. For an integer array, we might have a type_size of 4 and in this example, a length of 5 (for a buffer of 20 bytes).

When slicing, rather than copy the entire data, NumPy can simply increment the start and reduce the size.

array slice(array* array, size_t start, size_t end)
{
    array arr = *array;
    arr.start = (char*)arr.start + start;
    arr.length = end - start;
    return arr;
}

This is dramatically cheaper than allocating memory for a new list and then assigning (and incrementing, Python is reference counted) those pointers into the list.

Python Slicing

Here's a simplified Python example:

PyObject* slice(PyObject* list, size_t start, size_t end)
{
    size_t length = end - start;
    PyObject* out = PyList_New(length);
    for (size_t i = start; size_t i < end; ++i) {
        PyObject*item = PyList_GetItem(list, i);
        PyList_Append(&out, i);
    }

    return out;
}

Notice how much more involved this is? And a lot more goes under the hood.

Rational

Think performance: for NumPy to have the original slice behavior, it must occupy a new address in memory (since the data is contiguous in memory). This would mean copying the data, likely via memcpy(). This is expensive: say I have an array of 20,000 np.int32 (~80 KB), I would need to copy all this data to a new array. In the slice example above, I only copy ~24 bytes worth of memory (assuming 8-byte size_t and pointers).

numpy.array slicing behaviour

Answers (1)

Related Questions