Reputation: 73

Cython performance benchmark

I am trying to get a FLOPS benchmark for cython and numpy. I wrote a program in cython for this purpose. Here it is:

cimport numpy as np
import numpy as np
import time

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark():

    cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
    cdef np.ndarray[np.float64_t, ndim=1] m2 = np.random.rand(3)
    cdef np.ndarray[np.float64_t, ndim=1] res

    cdef int niters = 10000000
    cdef int x

    t1 = time.time()
    for x in range(niters):
        res = np.dot(m1, m2)
    t2 = time.time()
    cdef double numopsperloop = 9. + 6.
    cdef double totalops = numopsperloop * float(niters)
    cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
    print 'Computed MFLops is: ' + str(mflops)

On my machine I measure "Computed MFLops is: 7.42390102416". My machine has an Intel Core i7-6700HQ CPU @ 2.6 GHz and is running Windows 10.

If you want to run it on your machine, save the code in a file called "benchmark.pyx". Then create a file called "setup.py" with the following contents:

from distutils.core import setup                                                                 
from Cython.Build import cythonize                                                               
import numpy                                                                                     

setup(                                                                                           
    ext_modules = cythonize("benchmark.pyx"), 
    include_dirs=[numpy.get_include()]                                                           
)

Then you should be able to compile it with "python setup.py build_ext --inplace" . On windows it might be a little more difficult as I ran into the dreaded "unable to find vcvarsall.bat" error and had to spend significant effort working around that.

This performance seems pretty poor to me. I'm wondering if someone can run it on their platform and tell me what you get? or point out any obvious error that made in my code that is adversely affecting performance?

thanks!

Upvotes: 2

Answers (2)

jungleman007

Reputation: 73

After reading the post from DavidW carefully and doing some experimentation, I found a way to avoid all the numpy overhead. It involves using pointers and specifically not passing numpy arrays around to functions inside the loop.

Here is the full code:

cimport numpy as np
import numpy as np
import time


cdef matrixdotvector(double* mat, int numrows, int numcols, double* vec, double* outputvec):
    outputvec[0] = mat[0+0*numcols] * vec[0] + mat[1+0*numcols] * vec[1] + mat[2+0*numcols] * vec[2]
    outputvec[1] = mat[0+1*numcols] * vec[0] + mat[1+1*numcols] * vec[1] + mat[2+1*numcols] * vec[2]
    outputvec[2] = mat[0+2*numcols] * vec[0] + mat[1+2*numcols] * vec[1] + mat[2+2*numcols] * vec[2]

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark():

    cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
    cdef np.ndarray[np.float64_t, ndim=1] m2 = np.transpose(np.random.rand(3))
    cdef np.ndarray[np.float64_t, ndim=1] res

    cdef int niters = 10000000
    cdef int x

    t1 = time.time()
    for x in range(niters):
        res = np.dot(m1, m2)
    t2 = time.time()
    cdef double numopsperloop = 9. + 6.
    cdef double totalops = numopsperloop * float(niters)
    cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
    print 'Computed MFLops is: ' + str(mflops)

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark2():

    cdef int numrows = 3
    cdef int numcols = 3
    cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
    cdef np.ndarray[np.float64_t, ndim=1] m2 = np.transpose(np.random.rand(3))
    cdef np.ndarray[np.float64_t, ndim=1] res = np.zeros(3)

    cdef int niters = 10000000
    cdef int x

    t1 = time.time()
    for x in range(niters):
        matrixdotvector(&m1[0,0], numrows, numcols, &m2[0], &res[0])
    t2 = time.time()

    assert (np.linalg.norm(np.dot(m1,m2) - res) < 1.0e-6), "Arrays do not match"

    cdef double numopsperloop = 9. + 6.
    cdef double totalops = numopsperloop * float(niters)
    cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
    print 'Computed MFLops is: ' + str(mflops)

The big difference between numpybenchmark() and numpybenchmark2() is that I am avoiding all the numpy overhead by passing pointers to the numpy data array in numpybenchmark2() (as opposed to passing typed numpy objects, which is also very slow). I avoid the overhead of the np.dot computation by unrolling it and re-implementing it in code.

So the benchmark results I'm getting now are:

In [13]: benchmark.numpybenchmark() Computed MFLops is: 7.3412268815

In [14]: benchmark.numpybenchmark2() Computed MFLops is: 1521.81908107

So this is quite a big increase. Honestly this not a "pythonic" way to do this but it is screaming fast so may be useful in some circumstances. One could make the argument that this should all be written in C since the code in matrixdotvector() looks like C code. Personally I find it faster to implement prototypes using C-like cython code rather than jumping straight into C.

Anyway maybe this post will be useful someday for somebody who is learning cython.

Upvotes: 1

DavidW

Reputation: 30915

Cython doesn't actually eliminate any of the Python calling overhead on np.dot. This involves (note that the list isn't exhaustive, and may be slightly wrong in places, but it gives the gist):

Finding np.dot to call:
- A dictionary lookup in the global namespace for np
- A dictionary lookup in np's namespace for dot. (Note that all of the above could be eliminated by doing dot = np.dot inside your function, then calling dot)
- A dictionary lookup on dot for __call__. (This may be done through a quicker mechanism if dot is a C/Fortran compiled function)
Packing preparing the arguments for np.dot:
- Create a tuple containing the two arguments passed to np.dot
- Increment the reference counts of each of those arguments.
np.dot then processes the arguments ...
- Unpacks the tuple
- Checks each of the arguments in a numpy array.
- Checks that the dtype of each of the numpy arrays is the same, and based on the dtype picks which BLAS function to call.
- Checks the array dimensions and ensures that they match.
... allocates space for the output argument ...
- Allocates a new np.ndarray object
- Increments the refcount of that
- Allocates space for the physical array within the ndarray
... Calls the BLAS operation that gives you your floating point operations ...
... And decrements the refcount of the input arguments it was passed (checking to see if any should be freed, although none will be)
Your calling function then has to:
- Check if an exception was raised by np.dot
- Receive the output array (possibly some refcount juggling here)
- Decrements the refcount of the previous contents of res
- Frees the previous contents of res, remembering that it's at least a 2 step process because the array is held separately to the ndarray holder.

If you want to make most of this (except possibly the allocation) insignificant compared to the matrix-vector multiplication, then you need to do your measurement on significantly larger arrays. You could get rid of the allocation with the out optional argument in np.dot. If you want to make it all go away then you can use the scipy Cython BLAS interface to call the BLAS functions directly.

Upvotes: 2

Cython performance benchmark

Answers (2)

Related Questions