Reputation: 4254
In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this
for sub in Arr:
#do stuff using sub
Now the stuff that is done using sub
is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5
times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub
uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.
Upvotes: 4
Views: 2209
Reputation: 2628
San you can take a look at scipy.weave
. You can use scipy.weave.blitz
to transparently translate your expression into C++
code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for
loop does not create temporaries so your milage may vary.
However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline
. The drawback is that you have to write C++
code. This should not be too hard because you can use Blitz++
syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.
There are two work arounds:
is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++
will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.
For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).
Upvotes: 2
Reputation: 17114
Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.
Upvotes: 1
Reputation: 68682
If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:
http://docs.cython.org/src/tutorial/numpy.html
and I also recommend looking at the following paper
You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.
Upvotes: 6