Reputation: 73
I am trying to get a FLOPS benchmark for cython and numpy. I wrote a program in cython for this purpose. Here it is:
cimport numpy as np
import numpy as np
import time
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark():
cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
cdef np.ndarray[np.float64_t, ndim=1] m2 = np.random.rand(3)
cdef np.ndarray[np.float64_t, ndim=1] res
cdef int niters = 10000000
cdef int x
t1 = time.time()
for x in range(niters):
res = np.dot(m1, m2)
t2 = time.time()
cdef double numopsperloop = 9. + 6.
cdef double totalops = numopsperloop * float(niters)
cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
print 'Computed MFLops is: ' + str(mflops)
On my machine I measure "Computed MFLops is: 7.42390102416". My machine has an Intel Core i7-6700HQ CPU @ 2.6 GHz and is running Windows 10.
If you want to run it on your machine, save the code in a file called "benchmark.pyx". Then create a file called "setup.py" with the following contents:
from distutils.core import setup
from Cython.Build import cythonize
import numpy
setup(
ext_modules = cythonize("benchmark.pyx"),
include_dirs=[numpy.get_include()]
)
Then you should be able to compile it with "python setup.py build_ext --inplace" . On windows it might be a little more difficult as I ran into the dreaded "unable to find vcvarsall.bat" error and had to spend significant effort working around that.
This performance seems pretty poor to me. I'm wondering if someone can run it on their platform and tell me what you get? or point out any obvious error that made in my code that is adversely affecting performance?
thanks!
Upvotes: 2
Views: 2018
Reputation: 73
After reading the post from DavidW carefully and doing some experimentation, I found a way to avoid all the numpy overhead. It involves using pointers and specifically not passing numpy arrays around to functions inside the loop.
Here is the full code:
cimport numpy as np
import numpy as np
import time
cdef matrixdotvector(double* mat, int numrows, int numcols, double* vec, double* outputvec):
outputvec[0] = mat[0+0*numcols] * vec[0] + mat[1+0*numcols] * vec[1] + mat[2+0*numcols] * vec[2]
outputvec[1] = mat[0+1*numcols] * vec[0] + mat[1+1*numcols] * vec[1] + mat[2+1*numcols] * vec[2]
outputvec[2] = mat[0+2*numcols] * vec[0] + mat[1+2*numcols] * vec[1] + mat[2+2*numcols] * vec[2]
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark():
cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
cdef np.ndarray[np.float64_t, ndim=1] m2 = np.transpose(np.random.rand(3))
cdef np.ndarray[np.float64_t, ndim=1] res
cdef int niters = 10000000
cdef int x
t1 = time.time()
for x in range(niters):
res = np.dot(m1, m2)
t2 = time.time()
cdef double numopsperloop = 9. + 6.
cdef double totalops = numopsperloop * float(niters)
cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
print 'Computed MFLops is: ' + str(mflops)
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def numpybenchmark2():
cdef int numrows = 3
cdef int numcols = 3
cdef np.ndarray[np.float64_t, ndim=2] m1 = np.random.rand(3,3)
cdef np.ndarray[np.float64_t, ndim=1] m2 = np.transpose(np.random.rand(3))
cdef np.ndarray[np.float64_t, ndim=1] res = np.zeros(3)
cdef int niters = 10000000
cdef int x
t1 = time.time()
for x in range(niters):
matrixdotvector(&m1[0,0], numrows, numcols, &m2[0], &res[0])
t2 = time.time()
assert (np.linalg.norm(np.dot(m1,m2) - res) < 1.0e-6), "Arrays do not match"
cdef double numopsperloop = 9. + 6.
cdef double totalops = numopsperloop * float(niters)
cdef double mflops = totalops / (t2-t1) / 1024. / 1024.
print 'Computed MFLops is: ' + str(mflops)
The big difference between numpybenchmark() and numpybenchmark2() is that I am avoiding all the numpy overhead by passing pointers to the numpy data array in numpybenchmark2() (as opposed to passing typed numpy objects, which is also very slow). I avoid the overhead of the np.dot computation by unrolling it and re-implementing it in code.
So the benchmark results I'm getting now are:
In [13]: benchmark.numpybenchmark() Computed MFLops is: 7.3412268815
In [14]: benchmark.numpybenchmark2() Computed MFLops is: 1521.81908107
So this is quite a big increase. Honestly this not a "pythonic" way to do this but it is screaming fast so may be useful in some circumstances. One could make the argument that this should all be written in C since the code in matrixdotvector() looks like C code. Personally I find it faster to implement prototypes using C-like cython code rather than jumping straight into C.
Anyway maybe this post will be useful someday for somebody who is learning cython.
Upvotes: 1
Reputation: 30915
Cython doesn't actually eliminate any of the Python calling overhead on np.dot
. This involves (note that the list isn't exhaustive, and may be slightly wrong in places, but it gives the gist):
Finding np.dot
to call:
np
np
's namespace for dot
. (Note that all of the above could be eliminated by doing dot = np.dot
inside your function, then calling dot
)dot
for __call__
. (This may be done through a quicker mechanism if dot is a C/Fortran compiled function)Packing preparing the arguments for np.dot
:
np.dot
np.dot
then processes the arguments ...
dtype
of each of the numpy arrays is the same, and based on the dtype
picks which BLAS function to call.... allocates space for the output argument ...
np.ndarray
objectndarray
... Calls the BLAS operation that gives you your floating point operations ...
... And decrements the refcount of the input arguments it was passed (checking to see if any should be freed, although none will be)
Your calling function then has to:
np.dot
res
res
, remembering that it's at least a 2 step process because the array is held separately to the ndarray
holder.If you want to make most of this (except possibly the allocation) insignificant compared to the matrix-vector multiplication, then you need to do your measurement on significantly larger arrays. You could get rid of the allocation with the out
optional argument in np.dot
. If you want to make it all go away then you can use the scipy Cython BLAS interface to call the BLAS functions directly.
Upvotes: 2