Reputation: 2227
I wanted to test cython
performance comparing it to standard python. So here I have 3 examples of a function which will loop through 200 ints adding the same number to the result over and over again and then returning the result. In the timeit
module I made it to be called 1.000.000
times.
So there's the first example:
[frynio@manjaro ctest]$ cat nocdefexample.pyx
def nocdef(int num):
cdef int result = 0
for i in range(num):
result += num
return result
def xd(int num):
return nocdef(num)
Here's the second (look closely, the first function definition matters):
[frynio@manjaro ctest]$ cat cdefexample.pyx
cdef int cdefex(int num):
cdef int result = 0
for i in range(num):
result += num
return result
def xd1(int num):
return cdefex(num)
And there's the third one, which is placed in the main file:
[frynio@manjaro ctest]$ cat test.py
from nocdefexample import xd
from cdefexample import xd1
import timeit
def standardpython(num):
result = 0
for i in range(num):
result += num
return result
def xd2(num):
return standardpython(num)
print(timeit.timeit('xd(200)', setup='from nocdefexample import xd', number=1000000))
print(timeit.timeit('xd1(200)', setup='from cdefexample import xd1', number=1000000))
print(timeit.timeit('xd2(200)', setup='from __main__ import xd2', number=1000000))
I compiled it with cythonize -a -i nocdefexample.pyx cdefexample.pyx
and I got two .so
s. Then when I run python test.py
- this shows up:
[frynio@manjaro ctest]$ python test.py
0.10323301900007209
0.06339033499989455
11.448068103000423
So the first one is only def <name>(int num)
. The second one (seems to be 1.5x
faster than the first one) is cdef int <name>(int num)
. And the last one is just def <name>(num)
.
The last ones performance is terrible, but that's what I wanted to see. The interesting thing for me is why those first two examples differ (I checked it many times, second is always ~1.5x
faster than the first one).
Is it only because I specified the return type?
And if so, does it mean that they're both cython
functions or is the first some kind of, I dunno, a mixed-type kinda function?
Upvotes: 1
Views: 383
Reputation: 34367
First, you must be aware, that in the case of cython-functions you are measuring just the overhead of calling a cdef
- vs. a def
-function:
>>> %timeit nocdef(1000)
60.5 ns ± 0.73 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit nocdef(10000)
60.1 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
The C-compiler recognizes, that the loop will result in num*num
and evaluates this multiplication directly without running the loop - and multiplication is equally fast for 10**3
and 10**4
.
This might come as surprise for a python-programmer, because the python-interpreter doesn't optimize and thus this loop has an O(n)
-running time:
>>> %timeit standardpython(1000)
43.7 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit standardpython(10000)
479 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Now, calling a cdef
function is much faster! Just look at the generated C-code for calling the cdef
version (actually the creation of python-integer is already incorporated):
__pyx_t_1 = __Pyx_PyInt_From_int(__pyx_f_4test_cdefex(__pyx_v_num)); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error)
__pyx_f_4test_cdefex
- is just a call of a C-function. Compared to call of def
-version which happens via the whole python-machinery (here kind of abbreviated):
...
__pyx_t_2 = __Pyx_GetModuleGlobalName(__pyx_n_s_nocdef); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
...
__pyx_t_3 = __Pyx_PyInt_From_int(__pyx_v_num); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 9, __pyx_L1_error)
...
__pyx_t_4 = PyMethod_GET_SELF(__pyx_t_2);
...
__pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_2, __pyx_t_3); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 9, __pyx_L1_error)
The Cython has to:
num
to be able to call a Python-function (__Pyx_PyInt_From_int
)__Pyx_GetModuleGlobalName
+ PyMethod_GET_SELF
)The first call is probably at least 100 times faster, but the overall speed-up is less than 2 only because calling the "inner"-function is not the only work that needs to be done: def
-functions xd
and xd1
have to be called anyway + the resulting python-integer must be created.
Fun-fact:
>>> %timeit nocdef(16)
44.1 ns ± 0.294 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit nocdef(17)
58.5 ns ± 0.638 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
The reason is the integer pool for values -5
...256
=16^2
so the values from this range can be constructed faster.
Specifying the return type doesn't play that big role in your example: it only decides, where the conversion to python-integer happens - either in nocdef
or xd1
- but it happens eventually.
Upvotes: 1