Francisco
Francisco

Reputation: 31

Cython: slow numpy arrays

I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.

I have came up with a very simple program to show this:

############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython

def func1(long N):

    cdef double sum1,sum2,sum3
    cdef long i

    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

def func2(long N):

    cdef np.ndarray[np.float64_t,ndim=1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

def func3(long N):

    cdef double sum_arr[3]
    cdef long i

    sum_arr[0] = 0.0
    sum_arr[1] = 0.0
    sum_arr[2] = 0.0

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr
##########################################

################## test.py ###############
import time
import test as test

N = 1000000000

for i in xrange(10):

    start = time.time()
    sum1,sum2,sum3 = test.func1(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func2(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func3(N)
    print 'Time taken = %.3f'%(time.time()-start)
############################################

And from python test.py I get:

Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418

Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595

Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418

My question is: why func2 is almost 2x slower than func1 and func3?

Is there a way to improve this?

Thanks!

######## UPDATE

My real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:

for i in xrange(N):
    corr1 = h(i,val)

    for j in xrange(N):
        corr2 = h(j,val)

        for k in xrange(N):
            corr3 = h(k,val)

            A = f(P[i,j,k])
            B = g(i,j,k)
            Arr[B] += A*corr1*corr2*corr3

where val is a property of the 3D array represented by a number. This number can be different for different fields.

Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:

If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:

for i in xrange(N):
    for p in xrange(num_3D_arrays):
        corr1[p] = h(i,val[p])

    for j in xrange(N):
        for p in xrange(num_3D_arrays):
            corr2[p] = h(j,val[p])

        for k in xrange(N):
            for p in xrange(num_3D_arrays):
                corr3[p] = h(k,val[p])

            B = g(i,j,k)
            for p in xrange(num_3D_arrays):
                A[p] = f(P[i,j,k])
                Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]

So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.

#

Upvotes: 1

Views: 2423

Answers (3)

user7138814
user7138814

Reputation: 2041

There are a couple things you can do to speed up array indexing in Cython:

So for your function:

@cython.boundscheck(False)
@cython.wraparound(False)
def func2(long N):

    cdef np.float64_t[::1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

For the original code Cython produced the following C code for the line sum_arr[0] += i:

__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
  __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
  if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
  __Pyx_RaiseBufferIndexError(__pyx_t_6);
  {__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

With the improvements above:

__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;

Upvotes: 3

hpaulj
hpaulj

Reputation: 231738

Look at the html produced by cython -a ...pyx.

For func1, the sum1 += i line expands to :

+15:         sum1 += i
    __pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);

for func3, with a C array

+45:         sum_arr[0] += i
    __pyx_t_3 = 0;
    (__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);

Slightly more complicated, but a straight forward c.

But for func2:

+29:         sum_arr[0] += i
    __pyx_t_12 = 0;
    __pyx_t_6 = -1;
    if (__pyx_t_12 < 0) {
      __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
      if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
    } else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
    if (unlikely(__pyx_t_6 != -1)) {
      __Pyx_RaiseBufferIndexError(__pyx_t_6);
      __PYX_ERR(0, 29, __pyx_L1_error)
    }
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

Much more complicated with references to numpy functions (e.g. Pyx_BUfPtrStrided1d). Even initializing the array is complicated:

+26:     sum_arr = np.zeros(3,dtype=np.float64)
  __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
   ....

I expect that moving the sum_arr creation to the calling Python, and passing it as an argument to func2 would save some time.

Have you read this guide for using memoryviews:

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

You'll get the best cython performance if you focus on writing the low level operations so they translate into simple c. In

    for k in xrange(N):
        corr3 = h(k,val)

        A = f(P[i,j,k])
        B = g(i,j,k)
        Arr[B] += A*corr1*corr2*corr3

It's not the loops on i,j,k that will slow you down. It's evaluating h, f, and g each time, as well as the Arr[B] +=.... Those functions should be tightly coded cython, not general Python functions. Look at the compiled simplicity of the sum3d function in the memoryview guide.

Upvotes: 0

B. M.
B. M.

Reputation: 18668

  • why func2 is almost 2x slower than func1?

    It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with sum=array([sum1,sum2,sum3])

  • How to speed python code ?

    1. Numpy is the first good idea , It raise nearly C speed with no effort.

    2. Numba can fill the gap with no effort too, and is very simple.

    3. Cython for critical cases.

Here some illustration of that:

# python way
def func1(N):
    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

# numpy way
def func2(N):
    aran=arange(float(N))
    sum1=aran.sum()
    sum2=(2.0*aran).sum()
    sum3=(3.0*aran).sum()
    return sum1,sum2,sum3

#numba way
import numba    
func3 =numba.njit(func1)

"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop

In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop

In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""

Upvotes: 0

Related Questions