EllipticalInitial
EllipticalInitial

Reputation: 1270

Cython Function Pointer Dereference Time (Compared to Calling Function Directly)

I have some Cython code involving extremely repetitive operations pixel-wise on Numpy arrays (representing BGR images) of the following form:

ctypedef double (*blend_type)(double, double) # function pointer
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef cnp.ndarray[cnp.float_t, ndim=3] blend_it(const double[:, :, :] array_1, const double[:, :, :] array_2, const blend_type blendfunc, const double opacity):
  # the base layer is a (array_1)
  # the blend layer is b (array_2)
  # base layer is below blend layer
  cdef Py_ssize_t y_len = array_1.shape[0]
  cdef Py_ssize_t x_len = array_1.shape[1]
  cdef Py_ssize_t a_channels = array_1.shape[2]
  cdef Py_ssize_t b_channels = array_2.shape[2]
  cdef cnp.ndarray[cnp.float_t, ndim=3] result = np.zeros((y_len, x_len, a_channels), dtype = np.float_)
  cdef double[:, :, :] result_view = result
  cdef Py_ssize_t x, y, c

  for y in range(y_len):
    for x in range(x_len):
      for c in range(3): # iterate over BGR channels first
        # calculate channel values via blend mode
        a = array_1[y, x, c]
        b = array_2[y, x, c]
        result_view[y, x, c] = blendfunc(a, b)
        # many other operations involving result_view...
  return result;

Where blendfunc refers to another cython function, such as the following overlay_pix:

cdef double overlay_pix(double a, double b):
  if a < 0.5:
    return 2*a*b
  else:
    return 1 - 2*(1 - a)*(1 - b)

The goal in using the function pointer is to avoid having to rewrite that huge mess of repetive code over and over again for each blending mode (of which there are plenty). Hence, I created an interface like this for each blending mode, saving me that trouble:

def overlay(double[:, :, :] array_1, double[:, :, :] array_2, double opacity = 1.0):
  return blend_it(array_1, array_2, overlay_pix, opacity)

However, it appears this costs me some time! I notice that, for extremely large images (such as 8K images and larger), there is a substantial time loss when using blendfunc in the blend_it function instead of making a direct call to overlay_pix in blend_it. I assume this is because blend_it has to dereference the function pointer each time in the iteration instead of having the function immediately available to it, but I don't know for sure.

The time loss is not ideal, but I certainly don't want to rewrite blend_it over and over again for each blending mode. Is there some way to avoid the time loss? Is there some way to turn the function pointer into a local function outside of the loop and then access it faster inside the loop?

Upvotes: 2

Views: 692

Answers (2)

DavidW
DavidW

Reputation: 30927

@ead's answer says two things of interest:

  1. C might be able to optimize it to a direct call. I don't think this is generally true except for fairly simple cases, and doesn't seem to be true with the compilers and code OP is using.
  2. In C++ you'd use templates instead - this is definitely true and since the template types are always known at compile-time optimization is usually easy.

Cython and C++ templates are a bit of a mess and so I don't think you want to use them here. However Cython does have a template-like feature called fused types. You can use fused types to get a compile-time optimization, as demonstrated below. The rough outline of the code is:

  1. Define a cdef class containing a staticmethod cdef function for all the operations you want to do.
  2. Define a fused type containing all the cdef classes about. (This is the limitation of this approach - it isn't easily extendable so if you want to add operations you have to edit the code)
  3. Define a function that takes a dummy argument of your fused type. Use this type to call the staticmethod.
  4. Define wrapper functions - you need to use the explicit [type] syntax to get it to work.

Code:

import cython

cdef class Plus:
    @staticmethod
    cdef double func(double x):
        return x+1    

cdef class Minus:
    @staticmethod
    cdef double func(double x):
        return x-1

ctypedef fused pick_func:
    Plus
    Minus

cdef run_func(double [::1] x, pick_func dummy):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = cython.typeof(dummy).func(x[i])
    return x.base

def run_func_plus(x):
    return run_func[Plus](x,Plus())

def run_func_minus(x):
    return run_func[Minus](x,Minus())

For comparison the equivalent code using function pointers is

cdef double add_one(double x):
    return x+1

cdef double minus_one(double x):
    return x-1

cdef run_func_ptr(double [::1] x, double (*f)(double)):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = f(x[i])
    return x.base

def run_func_ptr_plus(x):
    return run_func_ptr(x,add_one)

def run_func_ptr_minus(x):
    return run_func_ptr(x,minus_one)

Using timeit I get about a 2.5x speedup compared to using function pointers. This suggests that the function pointers don't get optimized for me (however I haven't tried changing compiler settings to try to improve that)

import numpy as np
import example

# show the two methods give the same answer
print(example.run_func_plus(np.ones((10,))))
print(example.run_func_minus(np.ones((10,))))

print(example.run_func_ptr_plus(np.ones((10,))))
print(example.run_func_ptr_minus(np.ones((10,))))

from timeit import timeit

# timing comparison
print(timeit("""run_func_plus(x)""",
             """from example import run_func_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

print(timeit("""run_func_ptr_plus(x)""",
             """from example import run_func_ptr_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

Upvotes: 2

ead
ead

Reputation: 34367

It is true, that using a function pointer can have some small additional cost, but most of the time the performance hit is due to the compiler no longer being able to inline the called function and to perform optimizations which where possible after the inlining.

I would like to demonstrate this on the following example, which is somewhat smaller than yours:

int f(int i){
    return i;
}

int sum_with_fun(){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=f(i);
    }
    return sum;
}

typedef int(*fun_ptr)(int);
int sum_with_ptr(fun_ptr ptr){
    int sum=0;
    for(int i=0;i<1000;i++){
        sum+=ptr(i);
    }
    return sum;
}

So there two versions for calculation sum f(i) for i=0...999: with function pointer and directly.

When compiled with -fno-inline (i.e. disabling inlining to level the ground), they produce almost identical assembler (here on godbolt.org) - the slightly difference is how the function is called:

callq  4004d0 <_Z1fi>  //direct call
...
callq  *%r12           //via ptr

performancewise this will not make much difference.

But when we drop the -fno-inline the compiler can shine for the direct version, as it becommes (here on godbolt.org)

_Z12sum_with_funv:
        movl    $499500, %eax
        ret

i.e. the whole loop is evaluated during the compilation, compared to unchanged indirect version, which needs to execute the loop during the runtime:

_Z12sum_with_ptrPFiiE:
        pushq   %r12
        movq    %rdi, %r12
        pushq   %rbp
        xorl    %ebp, %ebp
        pushq   %rbx
        xorl    %ebx, %ebx
.L5:
        movl    %ebx, %edi
        addl    $1, %ebx
        call    *%r12
        addl    %eax, %ebp
        cmpl    $1000, %ebx
        jne     .L5
        movl    %ebp, %eax
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret

So where does it leave you? You could wrap the indirect function with known pointers and the chances are high, that the compiler will be able to perform the above optimizations, see for example:

... 
int sum_with_f(){
    return sum_with_ptr(&f);
}

results in (here on godbolt.org):

_Z10sum_with_fv:
        movl    $499500, %eax
        ret

With the above approach, you are at the mercy of the compiler (but the modern compiler are mercyful) to do the inlining.

There are also other options, depending on what you actually use:

  1. In C++ one has templates to eliminate this kind of repeatative work without performance penalty.
  2. In C one would use macros with the same effect.
  3. Numpy uses a preprocessor to generate repeatative code, see for example this src-file, from which a c-file will be generated in a preprocess step.
  4. pandas uses an approach similar to numpy's for cython code, see for example hashtable_func_helper.pxi.in-file.

Upvotes: 0

Related Questions