Reputation: 31
I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.
I have came up with a very simple program to show this:
############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython
def func1(long N):
cdef double sum1,sum2,sum3
cdef long i
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
def func2(long N):
cdef np.ndarray[np.float64_t,ndim=1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
def func3(long N):
cdef double sum_arr[3]
cdef long i
sum_arr[0] = 0.0
sum_arr[1] = 0.0
sum_arr[2] = 0.0
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
##########################################
################## test.py ###############
import time
import test as test
N = 1000000000
for i in xrange(10):
start = time.time()
sum1,sum2,sum3 = test.func1(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func2(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func3(N)
print 'Time taken = %.3f'%(time.time()-start)
############################################
And from python test.py I get:
Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418
Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595
Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418
My question is: why func2 is almost 2x slower than func1 and func3?
Is there a way to improve this?
Thanks!
######## UPDATEMy real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:
for i in xrange(N):
corr1 = h(i,val)
for j in xrange(N):
corr2 = h(j,val)
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
where val is a property of the 3D array represented by a number. This number can be different for different fields.
Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:
If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:
for i in xrange(N):
for p in xrange(num_3D_arrays):
corr1[p] = h(i,val[p])
for j in xrange(N):
for p in xrange(num_3D_arrays):
corr2[p] = h(j,val[p])
for k in xrange(N):
for p in xrange(num_3D_arrays):
corr3[p] = h(k,val[p])
B = g(i,j,k)
for p in xrange(num_3D_arrays):
A[p] = f(P[i,j,k])
Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]
So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.
#Upvotes: 1
Views: 2423
Reputation: 2041
There are a couple things you can do to speed up array indexing in Cython:
So for your function:
@cython.boundscheck(False)
@cython.wraparound(False)
def func2(long N):
cdef np.float64_t[::1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
For the original code Cython produced the following C code for the line sum_arr[0] += i
:
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
With the improvements above:
__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;
Upvotes: 3
Reputation: 231738
Look at the html
produced by cython -a ...pyx
.
For func1
, the sum1 += i
line expands to :
+15: sum1 += i
__pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);
for func3
, with a C array
+45: sum_arr[0] += i
__pyx_t_3 = 0;
(__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);
Slightly more complicated, but a straight forward c
.
But for func2
:
+29: sum_arr[0] += i
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 29, __pyx_L1_error)
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
Much more complicated with references to numpy
functions (e.g. Pyx_BUfPtrStrided1d
). Even initializing the array is complicated:
+26: sum_arr = np.zeros(3,dtype=np.float64)
__pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
....
I expect that moving the sum_arr
creation to the calling Python, and passing it as an argument to func2
would save some time.
Have you read this guide for using memoryviews
:
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
You'll get the best cython
performance if you focus on writing the low level operations so they translate into simple c
. In
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
It's not the loops on i,j,k
that will slow you down. It's evaluating h
, f
, and g
each time, as well as the Arr[B] +=...
. Those functions should be tightly coded cython
, not general Python functions. Look at the compiled simplicity of the sum3d
function in the memoryview
guide.
Upvotes: 0
Reputation: 18668
why func2 is almost 2x slower than func1?
It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with
sum=array([sum1,sum2,sum3])
How to speed python code ?
Numpy is the first good idea , It raise nearly C speed with no effort.
Numba can fill the gap with no effort too, and is very simple.
Cython for critical cases.
Here some illustration of that:
# python way
def func1(N):
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
# numpy way
def func2(N):
aran=arange(float(N))
sum1=aran.sum()
sum2=(2.0*aran).sum()
sum3=(3.0*aran).sum()
return sum1,sum2,sum3
#numba way
import numba
func3 =numba.njit(func1)
"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop
In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop
In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""
Upvotes: 0