djhoese
djhoese

Reputation: 3647

Cython vs numpy performance scaling

I've been playing around with Cython in preparation for other work. I tried a simple test case and noticed something odd with the way my code performs for larger problem sizes. I created a simple min/max function that calculates the min and max of a 2D float32 array and compared it to running numpy.min(a), numpy.max(a). For an array of 10000 elements the performance was similar. For an array of 1000000 elements the cython performed much worse. Here's my cython code:

import numpy
cimport cython
cimport numpy

DTYPE = numpy.float32
ctypedef numpy.float32_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def minmax_float32(numpy.ndarray[DTYPE_t, ndim=2] arr):
    cdef DTYPE_t min = arr[0, 0]
    cdef DTYPE_t max = arr[0, 0]
    cdef int row_max = arr.shape[0]
    cdef int col_max = arr.shape[1]
    cdef int x, y
    for y in range(row_max):
        for x in range(col_max):
            if arr[y, x] < min:
                min = arr[y, x]
            if arr[y, x] > max:
                max = arr[y, x]

    return min, max

And here's my simple timing done in ipython:

a = numpy.random.random(10000).reshape((100, 100)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 22.2 µs per loop

%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 23.8 µs per loop

a = numpy.random.random(1000000).reshape((1000, 1000)).astype(numpy.float32)
%timeit -r3 -n50 (numpy.min(a), numpy.max(a))
# 50 loops, best of 3: 307 µs per loop

%timeit -r3 -n50 minmax_float32(a)
# 50 loops, best of 3: 1.22 ms per loop

307 / 22.2
# 13.82882882882883

1220 / 23.8
# 51.26050420168067

Does anyone have ideas for why cython takes so much longer for the larger input? And this was just something I was playing with, but if you have any tips or tricks I'm interested in hearing them. Thanks in advance.

Edit: I ran these tests on a macbook 10.10 with 8GB of memory. Compiled the cython with gcc from macports with the flags mentioned in their tutorials -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing.

Upvotes: 10

Views: 655

Answers (2)

Steve Barnes
Steve Barnes

Reputation: 28360

First to avoid confusion it is never a good idea to use the build in function names min and max as variable names so call the fmin and fmax.

Basically it is worth remembering that numpy is highly optimised, you might also try in your cython changing:

    for x in range(col_max):
        if arr[y, x] < min:
            min = arr[y, x]
        if arr[y, x] > max:
            max = arr[y, x]

to:

    for x in range(col_max):
        val = arr[y, x]
        if val < fmin:
            fmin = val
        if val > fmax:
            fmax = val

and adding the definition: cdef DTYPE_t val

This will reduce the number of array index operations from 4 to 1.

You could also try using:

    (fmin, fmax) = (min(fmin, val), max(fmax, val))

as it may show some improvement.

You can also make x, y, row_max and row_min into unsigned ints and turn off bounds checking by adding the function decorator @cython.boundscheck(False) # turn of bounds-checking for entire function

This tutorial is worth reading.

Upvotes: 1

nneonneo
nneonneo

Reputation: 179372

It looks like NumPy uses SSE instructions where available for min and max, which means they can likely take advantage of your hardware to a much greater extent than Cython can achieve.

Here's the source code for NumPy's min and max reduction implementations in SSE: https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src#L696. Note that they are using a preprocessor to automatically generate code for multiple datatypes and operations simultaneously.

Upvotes: 4

Related Questions