How to pass numpy logic functions to Cython correctly?

Question

What declarations should I be incorporating with a logic function / index operation so that Cython does the heavy lifting?

I have two large rasters in the form of numpy arrays of equal size. The first array contains vegetation index values and the second array contains field IDs. The goal is to average vegetation index values by field. Both arrays have pesky nodata values (-9999) that I would like to ignore.

Currently the function takes over 60 seconds to execute, which normally I wouldn’t mind so much but I'll be processing potentially hundreds of images. Even a 30 second improvement would be significant. So I’ve been exploring Cython as a way to help speed things up. I’ve been using the Cython numpy tutorial as a guide.

Example data

test_cy.pyx code:

import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function 

cpdef test():
  cdef np.ndarray[np.int16_t, ndim=2] ndvi_array = np.load("Z:cython_test/data/ndvi.npy")

  cdef np.ndarray[np.int16_t, ndim=2] field_array = np.load("Z:cython_test/data/field_array.npy")

  cdef np.ndarray[np.int16_t, ndim=1] unique_field = np.unique(field_array)
  unique_field = unique_field[unique_field != -9999]

  cdef int field_id
  cdef np.ndarray[np.int16_t, ndim=1] f_ndvi_values
  cdef double f_avg

  for field_id in unique_field :
      f_ndvi_values = ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)]
      f_avg = np.mean(f_ndvi_values)

Setup.py code:

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension

from Cython.Build import cythonize
import numpy

setup(ext_modules = cythonize('test_cy.pyx'),
      include_dirs=[numpy.get_include()])

After some researching and running:

cython -a test_cy.pyx

It seems the index operation ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)] is the bottleneck and is still relying on Python. I suspect I’m missing some vital declarations here. Including ndim didn’t have any effect.

I’m fairly new to numpy as well so I'm probably missing something obvious.

Paul Panzer · Accepted Answer

Your problem looks fairly vectorizable to me, so Cython might not be the best approach. (Cython shines when there are unavoidable fine grained loops.) As your dtype is int16 there is only a limited range of possible labels, so using np.bincount should be fairly efficient. Try something like (this is assuming all your valid values are >= 0 if that is not the case you'd have to shift - or (cheaper) view-cast to uint16 (since we are not doing any arithmetic on the labels that should be safe) - before using bincount):

mask = (ndvi_array != -9999) & (field_array != -9999)
nd = ndvi_array[mask]
fi = field_array[mask]
counts = np.bincount(fi, minlength=2**15)
sums = np.bincount(fi, nd, minlength=2**15)
valid = counts != 0
avgs = sums[valid] / counts[valid]

How to pass numpy logic functions to Cython correctly?

Answers (1)

Related Questions