CODEWITHSUNDEEP

pythonperformancenumpycython

Reputation: 12567

Does Cython offer any reasonably easy and efficient way to iterate Numpy arrays as if they were flat?

Let's say I want to implement Numpy's

x[:] += 1

in Cython. I could write

@cython.boundscheck(False)
@cython.wraparoundcheck(False)
def add1(np.ndarray[np.float32_t, ndim=1] x):
    cdef unsigned long i
    for i in range(len(x)):
        x[i] += 1

However, this only works with ndim = 1. I could use

add1(x.reshape(-1))

but this only works with contiguous x.

Does Cython offer any reasonably easy and efficient way to iterate Numpy arrays as if they were flat?

(Reimplementing this particular operation in Cython doesn't make sense, as the above Numpy's code should be as fast as it gets -- I'm just using this as a simple example)

UPDATE:

I benchmarked the proposed solutions:

@cython.boundscheck(False)
@cython.wraparound(False)
def add1_flat(np.ndarray x):
    cdef unsigned long i
    for i in range(x.size):
        x.flat[i] += 1

@cython.boundscheck(False)
@cython.wraparound(False)
def add1_nditer(np.ndarray x):
    it = np.nditer([x], op_flags=[['readwrite']])
    for i in it:
        i[...] += 1

The second function requires import numpy as np in addition to cimport. The results are:

a = np.zeros((1000, 1000))

b = a[100:-100, 100:-100]

%timeit b[:] += 1
1000 loops, best of 3: 1.31 ms per loop

%timeit add1_flat(b)
1 loops, best of 3: 316 ms per loop

%timeit add1_nditer(b)
1 loops, best of 3: 1.11 s per loop

So, they are 300 and 1000 times slower than Numpy.

UPDATE 2:

The add11 version uses a for loop inside of a for loop, and so doesn't iterate the array as if it were flat. However, it is as fast as Numpy in this case:

%timeit add1.add11(b)
1000 loops, best of 3: 1.39 ms per loop

On the other hand, add1_unravel, one of the proposed solutions, fails to modify the contents of b.

Upvotes: 3

Views: 1754

Answers (3)

Reputation: 231395

http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

Is a nice tutorial on using nditer. It ends with a cython version. nditer is meant to be the all purpose array(s) iterator in numpy c level code.

There are also good array examples on the Cython memoryview page:

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

http://docs.scipy.org/doc/numpy/reference/c-api.iterator.html

The data buffer of an ndarray is a flat buffer. So regardless of the array's shape and strides, you can iterate over the whole buffer in a flat c pointer fashion. But things like nditer and memoryview take care of the element size details. So in c level coding it is actually easier to step through all the elements in a flat fashion than it is to step by rows - going by rows has to take into account the row stride.

This runs in Python, and I think it will translate nicely into cython (I don't have that setup on my machine at the moment):

import numpy as np

def add1(x):
   it = np.nditer([x], op_flags=[['readwrite']])
   for i in it:
       i[...] += 1
   return it.operands[0]

x = np.arange(10).reshape(2,5)
y = add1(x)
print(x)
print(y)

https://github.com/hpaulj/numpy-einsum/blob/master/sop.pyx is a sum-of-products script that I wrote a while back to simulate einsum.

The core of its w = sop(x,y) calculation is:

it = np.nditer(ops, flags, op_flags, op_axes=op_axes, order=order)
it.operands[nop][...] = 0
it.reset()
for xarr, yarr, warr in it:
    x = xarr
    y = yarr
    w = warr
    size = x.shape[0]
    for i in range(size):
       w[i] = w[i] + x[i] * y[i]
return it.operands[nop]

===================

copying ideas from the nditer.html document, I got a version of add1 that is only half the speed of the native numpy a+1. The naive nditer (above) isn't much faster in cython than in python. A lot of the speedup may be due to the external loop.

@cython.boundscheck(False)
def add11(arg):
   cdef np.ndarray[double] x
   cdef int size
   cdef double value
   it = np.nditer([arg],
        flags=['external_loop','buffered'], 
        op_flags=[['readwrite']])
   for xarr in it:
       x = xarr
       size = x.shape[0]
       for i in range(size):
           x[i] = x[i]+1.0
   return it.operands[0]

I also coded this nditer in python with a print of size, and found that it iterated on your b with 78 blocks of size 8192 - that's a buffer size, not some characteristic of b and its data layout.

In [15]: a = np.zeros((1000, 1000)) 
In [16]: b = a[100:-100, 100:-100]

In [17]: timeit add1.add11(b)
100 loops, best of 3: 4.48 ms per loop

In [18]: timeit b[:] += 1
100 loops, best of 3: 8.76 ms per loop

In [19]: timeit add1.add1(b)    # for the unbuffered nditer 
1 loop, best of 3: 3.1 s per loop

In [21]: timeit add1.add11(a)
100 loops, best of 3: 5.44 ms per loop

Upvotes: 2

Reputation: 6214

With numpy.ravel

numpy.ravel(a, order='C')

Return a contiguous flattened array.

A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.

@cython.boundscheck(False)
@cython.wraparound(False)
def add1_ravel(np.ndarray xs):
    cdef unsigned long i
    cdef double[::1] aview

    narray = xs.ravel()
    aview = narray

    for i in range(aview.shape[0]):
        aview[i] += 1

    # return xs if the memory is shared
    if not narray.flags['OWNDATA'] or np.may_share_memory(xs, narray):
        return xs

    # otherwise new array reshaped
    shape = tuple(xs.shape[i] for i in range(xs.ndim))
    return narray.reshape(shape)

Upvotes: 1

Reputation: 6430

You could try using the flat attribute of the ndarray, which provides an iterator over the flattened array object. It always iterates in C-major ordering, with the last index varying the fastest. Something like:

for i in range(x.size):
    x.flat[i] += 1

Upvotes: 1

Related Questions