Cython Function Pointer Dereference Time (Compared to Calling Function Directly)

Question

I have some Cython code involving extremely repetitive operations pixel-wise on Numpy arrays (representing BGR images) of the following form:

ctypedef double (*blend_type)(double, double) # function pointer
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef cnp.ndarray[cnp.float_t, ndim=3] blend_it(const double[:, :, :] array_1, const double[:, :, :] array_2, const blend_type blendfunc, const double opacity):
  # the base layer is a (array_1)
  # the blend layer is b (array_2)
  # base layer is below blend layer
  cdef Py_ssize_t y_len = array_1.shape[0]
  cdef Py_ssize_t x_len = array_1.shape[1]
  cdef Py_ssize_t a_channels = array_1.shape[2]
  cdef Py_ssize_t b_channels = array_2.shape[2]
  cdef cnp.ndarray[cnp.float_t, ndim=3] result = np.zeros((y_len, x_len, a_channels), dtype = np.float_)
  cdef double[:, :, :] result_view = result
  cdef Py_ssize_t x, y, c

  for y in range(y_len):
    for x in range(x_len):
      for c in range(3): # iterate over BGR channels first
        # calculate channel values via blend mode
        a = array_1[y, x, c]
        b = array_2[y, x, c]
        result_view[y, x, c] = blendfunc(a, b)
        # many other operations involving result_view...
  return result;

Where blendfunc refers to another cython function, such as the following overlay_pix:

cdef double overlay_pix(double a, double b):
  if a < 0.5:
    return 2*a*b
  else:
    return 1 - 2*(1 - a)*(1 - b)

The goal in using the function pointer is to avoid having to rewrite that huge mess of repetive code over and over again for each blending mode (of which there are plenty). Hence, I created an interface like this for each blending mode, saving me that trouble:

def overlay(double[:, :, :] array_1, double[:, :, :] array_2, double opacity = 1.0):
  return blend_it(array_1, array_2, overlay_pix, opacity)

However, it appears this costs me some time! I notice that, for extremely large images (such as 8K images and larger), there is a substantial time loss when using blendfunc in the blend_it function instead of making a direct call to overlay_pix in blend_it. I assume this is because blend_it has to dereference the function pointer each time in the iteration instead of having the function immediately available to it, but I don't know for sure.

The time loss is not ideal, but I certainly don't want to rewrite blend_it over and over again for each blending mode. Is there some way to avoid the time loss? Is there some way to turn the function pointer into a local function outside of the loop and then access it faster inside the loop?

DavidW · Accepted Answer

@ead's answer says two things of interest:

C might be able to optimize it to a direct call. I don't think this is generally true except for fairly simple cases, and doesn't seem to be true with the compilers and code OP is using.
In C++ you'd use templates instead - this is definitely true and since the template types are always known at compile-time optimization is usually easy.

Cython and C++ templates are a bit of a mess and so I don't think you want to use them here. However Cython does have a template-like feature called fused types. You can use fused types to get a compile-time optimization, as demonstrated below. The rough outline of the code is:

Define a cdef class containing a staticmethod cdef function for all the operations you want to do.
Define a fused type containing all the cdef classes about. (This is the limitation of this approach - it isn't easily extendable so if you want to add operations you have to edit the code)
Define a function that takes a dummy argument of your fused type. Use this type to call the staticmethod.
Define wrapper functions - you need to use the explicit [type] syntax to get it to work.

Code:

import cython

cdef class Plus:
    @staticmethod
    cdef double func(double x):
        return x+1    

cdef class Minus:
    @staticmethod
    cdef double func(double x):
        return x-1

ctypedef fused pick_func:
    Plus
    Minus

cdef run_func(double [::1] x, pick_func dummy):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = cython.typeof(dummy).func(x[i])
    return x.base

def run_func_plus(x):
    return run_func[Plus](x,Plus())

def run_func_minus(x):
    return run_func[Minus](x,Minus())

For comparison the equivalent code using function pointers is

cdef double add_one(double x):
    return x+1

cdef double minus_one(double x):
    return x-1

cdef run_func_ptr(double [::1] x, double (*f)(double)):
    cdef int i
    with cython.boundscheck(False), cython.wraparound(False):
        for i in range(x.shape[0]):
            x[i] = f(x[i])
    return x.base

def run_func_ptr_plus(x):
    return run_func_ptr(x,add_one)

def run_func_ptr_minus(x):
    return run_func_ptr(x,minus_one)

Using timeit I get about a 2.5x speedup compared to using function pointers. This suggests that the function pointers don't get optimized for me (however I haven't tried changing compiler settings to try to improve that)

import numpy as np
import example

# show the two methods give the same answer
print(example.run_func_plus(np.ones((10,))))
print(example.run_func_minus(np.ones((10,))))

print(example.run_func_ptr_plus(np.ones((10,))))
print(example.run_func_ptr_minus(np.ones((10,))))

from timeit import timeit

# timing comparison
print(timeit("""run_func_plus(x)""",
             """from example import run_func_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

print(timeit("""run_func_ptr_plus(x)""",
             """from example import run_func_ptr_plus
from numpy import zeros
x = zeros((10000,))
""",number=10000))

Cython Function Pointer Dereference Time (Compared to Calling Function Directly)

Answers (2)

Related Questions