Reputation: 1
I am trying to speed up some pure-Python code using Cython. Here is the original Python code:
import numpy as np
def image_to_mblocks(image_component):
img_shape = np.shape(image_component)
v_mblocks = img_shape[0] // 16
h_mblocks = img_shape[1] // 16
x = image_component
x = [x[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:] for i in range(v_mblocks) for j in range(h_mblocks)]
return x
The argument image_component
is a 2-dimensional numpy.ndarray
, where the length of each dimension is evenly divisible by 16. In pure Python, this function is fast--on my machine, 100 calls with image_component
of shape (640, 480)
takes 80 ms. However, I need to call this function on the order of thousands to tens of thousands of times, so I am interested in speeding it up.
Here is my Cython implementation:
import numpy as np
cimport numpy as np
cimport cython
ctypedef unsigned char DTYPE_pixel
cpdef np.ndarray[DTYPE_pixel, ndim=3] image_to_mblocks(unsigned char[:, :] image_component):
cdef int i
cdef int j
cdef int k = 0
cdef int v_mblocks = image_component.shape[0] / 16
cdef int h_mblocks = image_component.shape[1] / 16
cdef np.ndarray[DTYPE_pixel, ndim=3] x = np.empty((v_mblocks*h_mblocks, 16, 16), dtype=np.uint8)
for j in range(h_mblocks):
for i in range(v_mblocks):
x[k] = image_component[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]
k += 1
return x
The Cython implementation uses a typed MemoryView in order to support slicing of image_component
. This Cython implementation takes 250 ms on my machine for 100 iterations (same conditions as before: image_component
is a (640, 480)
array).
Here is my question: in the example I've given, why does Cython fail to outperform the pure Python implementation?
I believe I've followed all the steps in the Cython documentation for working with numpy arrays, but I've failed to achieve the performance boost that I was expecting.
For reference, here is what my setup.py file looks like:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
extensions = [
Extension('proto_mpeg_computation', ['proto_mpeg_computation.pyx'],
include_dirs=[numpy.get_include()]
),
]
setup(
name = "proto_mpeg_x",
ext_modules = cythonize(extensions)
)
Upvotes: 0
Views: 888
Reputation: 30936
The reason you have significantly worse performance is that the Cython version is copying data and the original version is creating references to existing data.
The line
x[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]
creates a view on the original x
array (i.e. if you change x
then the view will change too). You can confirm this by checking the numpy owndata
flag is False
on the elements of the array that is returned from your Python function. This operation is very cheap because all it does is store a pointer and some shape/stride information.
In the Cython version you do
x[k] = image_component[i * 16:(i + 1) * 16:, j * 16:(j + 1) * 16:]
This needs to copy a 16 by 16 array into the memory already allocated for x
. It isn't ultra-slow, but there's more work to do than in your original Python version. Again, confirm by checking owndata
on the function return value. You should find that it is True
.
In your case you should consider whether you want views of the data or copies of the data.
This isn't the sort of problem where Cython is going to help much in my view. Cython has some good speed up for indexing individual elements, however when you start to index slices then it behaves the same way as base Python/numpy (which is actually pretty efficient for this type of use).
I suspect you'd get a small gain from putting your original Python code into Cython, and typing image_component
as either unsigned char[:, :]
or np.ndarray[DTYPE_pixel, ndim=2]
. You can also cut out a tiny bit of reference counting by not using x
and just returning the list comprehension directly. Beyond that I don't see how you can gain much.
Upvotes: 2