Reputation: 1270
I have some Cython code involving extremely repetitive operations pixel-wise on Numpy arrays (representing BGR images) of the following form:
ctypedef double (*blend_type)(double, double) # function pointer
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
cdef cnp.ndarray[cnp.float_t, ndim=3] blend_it(const double[:, :, :] array_1, const double[:, :, :] array_2, const blend_type blendfunc, const double opacity):
# the base layer is a (array_1)
# the blend layer is b (array_2)
# base layer is below blend layer
cdef Py_ssize_t y_len = array_1.shape[0]
cdef Py_ssize_t x_len = array_1.shape[1]
cdef Py_ssize_t a_channels = array_1.shape[2]
cdef Py_ssize_t b_channels = array_2.shape[2]
cdef cnp.ndarray[cnp.float_t, ndim=3] result = np.zeros((y_len, x_len, a_channels), dtype = np.float_)
cdef double[:, :, :] result_view = result
cdef Py_ssize_t x, y, c
for y in range(y_len):
for x in range(x_len):
for c in range(3): # iterate over BGR channels first
# calculate channel values via blend mode
a = array_1[y, x, c]
b = array_2[y, x, c]
result_view[y, x, c] = blendfunc(a, b)
# many other operations involving result_view...
return result;
Where blendfunc
refers to another cython function, such as the following overlay_pix
:
cdef double overlay_pix(double a, double b):
if a < 0.5:
return 2*a*b
else:
return 1 - 2*(1 - a)*(1 - b)
The goal in using the function pointer is to avoid having to rewrite that huge mess of repetive code over and over again for each blending mode (of which there are plenty). Hence, I created an interface like this for each blending mode, saving me that trouble:
def overlay(double[:, :, :] array_1, double[:, :, :] array_2, double opacity = 1.0):
return blend_it(array_1, array_2, overlay_pix, opacity)
However, it appears this costs me some time! I notice that, for extremely large images (such as 8K images and larger), there is a substantial time loss when using blendfunc
in the blend_it
function instead of making a direct call to overlay_pix
in blend_it
. I assume this is because blend_it
has to dereference the function pointer each time in the iteration instead of having the function immediately available to it, but I don't know for sure.
The time loss is not ideal, but I certainly don't want to rewrite blend_it
over and over again for each blending mode. Is there some way to avoid the time loss? Is there some way to turn the function pointer into a local function outside of the loop and then access it faster inside the loop?
Upvotes: 2
Views: 692
Reputation: 30927
@ead's answer says two things of interest:
Cython and C++ templates are a bit of a mess and so I don't think you want to use them here. However Cython does have a template-like feature called fused types. You can use fused types to get a compile-time optimization, as demonstrated below. The rough outline of the code is:
cdef class
containing a staticmethod
cdef
function for all the operations you want to do.cdef class
es about. (This is the limitation of this approach - it isn't easily extendable so if you want to add operations you have to edit the code)staticmethod
.[type]
syntax to get it to work.Code:
import cython
cdef class Plus:
@staticmethod
cdef double func(double x):
return x+1
cdef class Minus:
@staticmethod
cdef double func(double x):
return x-1
ctypedef fused pick_func:
Plus
Minus
cdef run_func(double [::1] x, pick_func dummy):
cdef int i
with cython.boundscheck(False), cython.wraparound(False):
for i in range(x.shape[0]):
x[i] = cython.typeof(dummy).func(x[i])
return x.base
def run_func_plus(x):
return run_func[Plus](x,Plus())
def run_func_minus(x):
return run_func[Minus](x,Minus())
For comparison the equivalent code using function pointers is
cdef double add_one(double x):
return x+1
cdef double minus_one(double x):
return x-1
cdef run_func_ptr(double [::1] x, double (*f)(double)):
cdef int i
with cython.boundscheck(False), cython.wraparound(False):
for i in range(x.shape[0]):
x[i] = f(x[i])
return x.base
def run_func_ptr_plus(x):
return run_func_ptr(x,add_one)
def run_func_ptr_minus(x):
return run_func_ptr(x,minus_one)
Using timeit
I get about a 2.5x speedup compared to using function pointers. This suggests that the function pointers don't get optimized for me (however I haven't tried changing compiler settings to try to improve that)
import numpy as np
import example
# show the two methods give the same answer
print(example.run_func_plus(np.ones((10,))))
print(example.run_func_minus(np.ones((10,))))
print(example.run_func_ptr_plus(np.ones((10,))))
print(example.run_func_ptr_minus(np.ones((10,))))
from timeit import timeit
# timing comparison
print(timeit("""run_func_plus(x)""",
"""from example import run_func_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))
print(timeit("""run_func_ptr_plus(x)""",
"""from example import run_func_ptr_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))
Upvotes: 2
Reputation: 34367
It is true, that using a function pointer can have some small additional cost, but most of the time the performance hit is due to the compiler no longer being able to inline the called function and to perform optimizations which where possible after the inlining.
I would like to demonstrate this on the following example, which is somewhat smaller than yours:
int f(int i){
return i;
}
int sum_with_fun(){
int sum=0;
for(int i=0;i<1000;i++){
sum+=f(i);
}
return sum;
}
typedef int(*fun_ptr)(int);
int sum_with_ptr(fun_ptr ptr){
int sum=0;
for(int i=0;i<1000;i++){
sum+=ptr(i);
}
return sum;
}
So there two versions for calculation sum f(i) for i=0...999
: with function pointer and directly.
When compiled with -fno-inline
(i.e. disabling inlining to level the ground), they produce almost identical assembler (here on godbolt.org) - the slightly difference is how the function is called:
callq 4004d0 <_Z1fi> //direct call
...
callq *%r12 //via ptr
performancewise this will not make much difference.
But when we drop the -fno-inline
the compiler can shine for the direct version, as it becommes (here on godbolt.org)
_Z12sum_with_funv:
movl $499500, %eax
ret
i.e. the whole loop is evaluated during the compilation, compared to unchanged indirect version, which needs to execute the loop during the runtime:
_Z12sum_with_ptrPFiiE:
pushq %r12
movq %rdi, %r12
pushq %rbp
xorl %ebp, %ebp
pushq %rbx
xorl %ebx, %ebx
.L5:
movl %ebx, %edi
addl $1, %ebx
call *%r12
addl %eax, %ebp
cmpl $1000, %ebx
jne .L5
movl %ebp, %eax
popq %rbx
popq %rbp
popq %r12
ret
So where does it leave you? You could wrap the indirect function with known pointers and the chances are high, that the compiler will be able to perform the above optimizations, see for example:
...
int sum_with_f(){
return sum_with_ptr(&f);
}
results in (here on godbolt.org):
_Z10sum_with_fv:
movl $499500, %eax
ret
With the above approach, you are at the mercy of the compiler (but the modern compiler are mercyful) to do the inlining.
There are also other options, depending on what you actually use:
Upvotes: 0