Fortran like array slicing in Cython

Question

Im looking for a concise and efficient way to take multidimensional slices of arrays, perform scalar and matrix arithmetic on on those slices, and then ultimately save the resulting array as a slice in another array.

You can do this well in fortran with syntax like this:

real*8, dimension(4,4,4,4)               ::  matrix_a
real*8, dimension(4,4)               ::  matrix_b
...
matrix_a(:, 2, :, 4) =  matrix_a(:, 2, :, 4) + (2 * matrix_b(:, :))

I am trying to find ways to do this in Cython. This is the best I can come up with:

cdef double matrix_a[4][4][4][4]
cdef double matrix_b[4][4]
...
cdef int i0, i1
for i0 in range(4):
  for i1 in range(4):
    matrix_a[i0][1][i1][3] += (2 * matrix_b[i0][i1])

Apparently you can do this in a very concise way with Numpy arrays, but this seems to be a couple of orders of magnitude slower than the code above. Is there some way I can have the best of both worlds here? Possibly some way that I can invoke Numpy in a way that will make it faster? Or maybe some kind of C/C++ API that could be leveraged from Cython. Thanks

IanH · Accepted Answer

Great question. The short answer is, no.

The long answer... it depends on what you want and how much effort you're willing to put into it. There's no single answer because the question is very broad, not because the operations aren't possible.

As of this writing, there is no default way to perform fortran-speed arithmetic on cython's memoryviews. There have been multiple discussions about this on the Cython mailing lists. This recent thread sums up the situation well.

Now, there are a lot of ways to get fast arithmetic with NumPy arrays. No single one has all the features of the others, but some may suit your needs.

First:

NumPy's vectorization is very well written. There is some overhead up front in type inference, but, for anything other than operations on small arrays inside tight loops, the majority of the computation time is run in very well-optimized loops in C. If you're doing a simple in-place operation like this, even in Cython, the best way to do it may be to just do something like np.add(a, b, out=c). If you have more specific needs, np.einsum, scipy's blas/lapack wrappers, and numpy's array reduction methods offer greater flexibility.

In most cases, if a and b are numpy arrays, something like

a[:,2,:,4] += 2 * b

should be sufficient.

Even if you're in Cython and operating on memory views rather than numpy arrays, you can do something like:

import numpy as np
np.add(a, b, out=c)

There is a fixed cost for calling a Python function like this, but if the arrays are large, it won't matter.

There are two reasons numpy would still be slower in your case. First, there is a high fixed cost in starting the looping operations since numpy processes a lot of its type and shape metadata at runtine. This difference vanishes for larger arrays, so, unless you're inside a tight loop, it won't matter. Second, the default memory layouts for numpy and fortran are different. This difference will probably only show up for operations with larger arrays. If you initialize your numpy arrays in fortran memory order and the arrays are large, numpy should be on-par with fortran.

Now, for those who really really are in a situation where numpy is either not fast enough or doesn't give fine enough control over the array, there are quite a few other options. Here are some of the more promising ones:

Write the arithmetic operation with loops inside a cdef function in Cython. Have it accept memory views as arguments and pass in a pre-allocated output array. The signature could look something like cdef inline void add(double[:,:] a, double[:,:] b, double[:,:] out) nogil: Memoryviews can be sliced without the overhead of a Python call, so you can pass slices to the function you've defined without incurring the overhead of a Python call. This method isn't the fastest, but, honestly, it's usually good enough. It also avoids most of the pain that comes from having to manually manage the memory layout of the array. This will be easier than doing the same thing with raw C arrays too, since memoryviews have their memory managed dynamically.
Several numpy-aware auto-vecotrizers exist. Consider trying numba, parakeet, or pythran. These packages center primarily on compiling type-specific versions of functions that operate on numpy functions and then calling those routines where applicable. I've had mixed results with numba and parakeet and haven't played much with pythran. As to my knowledge, numba and parakeet don't currently do any sort of static expression analysis, though it isn't necessarily outside of their scope. Pythran looks promising though. It's main selling point is performing static analysis of array expressions (like Fortran). These libraries are easy to use, so it could be worth trying them just to see if they help. They are especially good at optimizing loops.
There are also some Python-level expression optimizers that you could use. See numexpr and Theano. These packages, roughly speaking, focus on compiling arithmetic expressions to machine code, but not on whole functions. As to my knowledge, they don't optimize for slices, but you can pass them slices of a numpy array. These are best at optimizing operations on larger arrays, but they can often provide code for an expression that is faster than direct arithmetic with numpy arrays.
You can implement the operation in fortran and wrap it using Cython. There's a description of how to do this with fortran at the fortran90 website. I wrote an example of how to do this some time ago in this answer. You'll have to manage memory layout yourself, but this isn't too hard and it will be very fast.
You can also use any one of a variety of C++ libraries for array operations. Unfortunately, many of them(eigen, armadillo, blaze-lib) don't currently have built-in support for high dimensional arrays, but the particular expression you have is actually just arithmetic with a strided 2D array, so it could work fine. If you are okay with that constraint, there have been some initial efforts toward making Cython bindings for armadillo (cy-arma, though they are still underdeveloped. One c++ library that does support n-dimensional array operations is libdynd. It also has a better-developed set of python bindings that are written in Cython. It'd probably be easiest to implement the operations in C++ and wrap them for Python via Cython, but you could probably take the C++ pxd declarations out of the wrappers and try writing your functions at the Cython level.
There is a library called ceygen that uses eigen as a backend to write arithmetic operations on memory views as function calls. I've had trouble getting it to compile, but, conceptually, this might be what you're looking for. I'm unsure of whether or not it handles striding information well.
You can implement the operations yourself in C. The only time this would be better would be when you want to apply sse intrinsics to the array (or some sort of optimization that is better than pure looping). Writing things like this is very time consuming, but you could do it. If you're just looping, you'd be better off writing the loops in Cython and operating on memoryviews.
You could use the NumPy C API for ufuncs or gufuncs. This allows you to take a scalar (or array) operation and generate a function that applies it to more general arrays. The docs are not very complete right now.
This doesn't work for basic arithmetic operations in Cython, but one of the new features that will be added to scipy some time soon are Cython-level wrappers for BLAS and LAPACK. That will let you do a wide variety of linear algebra operations without little cost for each function call. For now, here's the relevant pull request.

Some of these will require that you manage memory layout yourself, but that's not terribly hard.

If I were in your shoes, I'd try option 1, and then option 4. Option 1 will be reasonably fast, but if that's not enough, option 4 will probably be faster, though now you'll have to worry about the memory layout of the array yourself.

Fortran like array slicing in Cython

Answers (1)

Related Questions