san
san

Reputation: 4254

Speeding up loops over a Numpy array

In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this

for sub in Arr:
  #do stuff using sub

Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5 times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.

Upvotes: 4

Views: 2209

Answers (3)

srean
srean

Reputation: 2628

San you can take a look at scipy.weave. You can use scipy.weave.blitz to transparently translate your expression into C++ code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for loop does not create temporaries so your milage may vary.

However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline. The drawback is that you have to write C++ code. This should not be too hard because you can use Blitz++ syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.

There are two work arounds:

  1. is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++ will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.

  2. For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).

Upvotes: 2

riza
riza

Reputation: 17114

Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.

Upvotes: 1

JoshAdel
JoshAdel

Reputation: 68682

If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:

http://docs.cython.org/src/tutorial/numpy.html

and I also recommend looking at the following paper

You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.

Upvotes: 6

Related Questions