Apocheir
Apocheir

Reputation: 187

Order of indexes in a Numpy multidimensional array

For example, say I'm simulating a bunch of particles doing something over time, and I have a multidimensional array called particles with these indexes:

Is it better to construct the array such that particles.shape == (a, b, c) or particles.shape == (c, b, a)?

I'm more interested in convention than efficiency: Numpy arrays can be set up in either C-style (last index varies most rapidly) or Fortran-style (first index), so it can efficiently support either setup. I also realize I can use transpose to put the indexes in any order I need, but I'd like to minimize that.

I started to research this myself and found support for both ways:

Pro-(c,b,a):

Pro-(a,b,c):

In general it appears that there are two different conventions in existence (probably due to historical differences between C and Fortran), and it's not clear which is more common in the Numpy community, or more appropriate for what I'm doing.

Upvotes: 5

Views: 3336

Answers (2)

hpaulj
hpaulj

Reputation: 231375

Another bias is that when a new dimension has to be added, the numpy preference is to do so on the left. That is x[None,...] is automatic

np.array([x,y,z])   # produces a (3,...) array

np.ones((3,2)) + np.ones((1,2,10)) # error
np.ones((3,2,1)) + np.ones((2,10))  # (3,2,10)

But I don't see how this front-first broadcasting favors one position or the other for the x/y/z coordinate.

While np.dot uses a convention of last/2nd to last, np.tensordot and np.einsum are much more general.


Apocheir points out that doing a reduction on the last axis may require adding a newaxis back, eg

 x / np.linalg.norm(x,axis=0)   # automatic newaxis at beginning
 x / np.linalg.norm(x,axis=-1)[...,np.newaxis]  # explicit newaxis

for small x, this explicit newaxis adds measurable execution time. But for large x, the 2nd calculation is faster. I think that's because reduction on the last axis is faster - that's the axis that changes faster (for order='C').

A number of the builtin reduction methods have a keepdims parameter to facilitate broadcasting in uses like this (e.g. sum, mean).

Upvotes: 2

Joe Kington
Joe Kington

Reputation: 284612

Conventions for something like this have much more to do with particular file-formats than anything else, in my experience. However, there's a quick way to answer which one is likely to be best for what you're doing:

If you have to iterate over an axis, which one are you most likely to iterate over? In other words, which of these is most likely:

# a first
for dimension in particles:
    ...

# b first
for particle in particles:
    ...

# c first
for timestep in particles:
    ...

As far as efficiency goes, this assumes C-order, but that's actually irrelevant here. At the python level, access to numpy arrays is treated as C-ordered regardless of the memory layout. (You always iterate over the first axis, even if that's not the "most contiguous" axis in memory.)

Of course, there are many situations where you should avoid directly iterating over numpy arrays in this matter. Nonetheless, this is the way you should think about it, particularly when it comes to on-disk file structures. Make your most common use case the fastest/easiest.

If nothing else, hopefully this gives you a useful way to think about the question.

Upvotes: 4

Related Questions