Differences between 3 cython/python function calls

Question

I wanted to test cython performance comparing it to standard python. So here I have 3 examples of a function which will loop through 200 ints adding the same number to the result over and over again and then returning the result. In the timeit module I made it to be called 1.000.000 times.

So there's the first example:

[frynio@manjaro ctest]$ cat nocdefexample.pyx 
def nocdef(int num):
    cdef int result = 0
    for i in range(num):
        result += num
    return result


def xd(int num):
    return nocdef(num)

Here's the second (look closely, the first function definition matters):

[frynio@manjaro ctest]$ cat cdefexample.pyx 
cdef int cdefex(int num):
    cdef int result = 0
    for i in range(num):
        result += num
    return result


def xd1(int num):
    return cdefex(num)

And there's the third one, which is placed in the main file:

[frynio@manjaro ctest]$ cat test.py
from nocdefexample import xd
from cdefexample import xd1
import timeit

def standardpython(num):
    result = 0
    for i in range(num):
        result += num
    return result

def xd2(num):
    return standardpython(num)

print(timeit.timeit('xd(200)', setup='from nocdefexample import xd', number=1000000))
print(timeit.timeit('xd1(200)', setup='from cdefexample import xd1', number=1000000))
print(timeit.timeit('xd2(200)', setup='from __main__ import xd2', number=1000000))

I compiled it with cythonize -a -i nocdefexample.pyx cdefexample.pyx and I got two .sos. Then when I run python test.py - this shows up:

[frynio@manjaro ctest]$ python test.py
0.10323301900007209
0.06339033499989455
11.448068103000423

So the first one is only def (int num). The second one (seems to be 1.5x faster than the first one) is cdef int (int num). And the last one is just def (num).

The last ones performance is terrible, but that's what I wanted to see. The interesting thing for me is why those first two examples differ (I checked it many times, second is always ~1.5x faster than the first one).

Is it only because I specified the return type?

And if so, does it mean that they're both cython functions or is the first some kind of, I dunno, a mixed-type kinda function?

ead · Accepted Answer

First, you must be aware, that in the case of cython-functions you are measuring just the overhead of calling a cdef- vs. a def-function:

>>> %timeit nocdef(1000)
60.5 ns ± 0.73 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

>>> %timeit nocdef(10000)
60.1 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

The C-compiler recognizes, that the loop will result in num*num and evaluates this multiplication directly without running the loop - and multiplication is equally fast for 10**3 and 10**4.

This might come as surprise for a python-programmer, because the python-interpreter doesn't optimize and thus this loop has an O(n)-running time:

>>> %timeit standardpython(1000)
43.7 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit standardpython(10000)
479 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, calling a cdef function is much faster! Just look at the generated C-code for calling the cdef version (actually the creation of python-integer is already incorporated):

__pyx_t_1 = __Pyx_PyInt_From_int(__pyx_f_4test_cdefex(__pyx_v_num)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error)

__pyx_f_4test_cdefex - is just a call of a C-function. Compared to call of def-version which happens via the whole python-machinery (here kind of abbreviated):

   ...
 __pyx_t_2 = __Pyx_GetModuleGlobalName(__pyx_n_s_nocdef); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
  ...
 __pyx_t_3 = __Pyx_PyInt_From_int(__pyx_v_num); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 9, __pyx_L1_error)
  ...
 __pyx_t_4 = PyMethod_GET_SELF(__pyx_t_2);
  ...
 __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_2, __pyx_t_3); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 9, __pyx_L1_error)

The Cython has to:

Create a python-integer from the C-int num to be able to call a Python-function (__Pyx_PyInt_From_int)
locate this method using its name (__Pyx_GetModuleGlobalName + PyMethod_GET_SELF)
and finally call the function.

The first call is probably at least 100 times faster, but the overall speed-up is less than 2 only because calling the "inner"-function is not the only work that needs to be done: def-functions xd and xd1 have to be called anyway + the resulting python-integer must be created.

Fun-fact:

 >>> %timeit nocdef(16)
 44.1 ns ± 0.294 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

 >>> %timeit nocdef(17)
 58.5 ns ± 0.638 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

The reason is the integer pool for values -5...256=16^2 so the values from this range can be constructed faster.

Specifying the return type doesn't play that big role in your example: it only decides, where the conversion to python-integer happens - either in nocdef or xd1 - but it happens eventually.

Differences between 3 cython/python function calls

Answers (1)

Related Questions