Is there any performance reason to use ndim 1 or 2 vectors in numpy?

Question

This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.

I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).

But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?

(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:

np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2

The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.

Ami Tavory · Accepted Answer

If you use numpy properly, then no - it is not a consideration.

If you look at the numpy internals documentation, you can see that

Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.

So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider

a = np.array([1, 2, 3, 4])

and

b = np.array([[1, 2], [3, 4]])

It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.

If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.

So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.

Is there any performance reason to use ndim 1 or 2 vectors in numpy?

Answers (1)

Related Questions