Reputation: 27
What declarations should I be incorporating with a logic function / index operation so that Cython does the heavy lifting?
I have two large rasters in the form of numpy arrays of equal size. The first array contains vegetation index values and the second array contains field IDs. The goal is to average vegetation index values by field. Both arrays have pesky nodata values (-9999) that I would like to ignore.
Currently the function takes over 60 seconds to execute, which normally I wouldn’t mind so much but I'll be processing potentially hundreds of images. Even a 30 second improvement would be significant. So I’ve been exploring Cython as a way to help speed things up. I’ve been using the Cython numpy tutorial as a guide.
test_cy.pyx code:
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False) # turn off negative index wrapping for entire function
cpdef test():
cdef np.ndarray[np.int16_t, ndim=2] ndvi_array = np.load("Z:cython_test/data/ndvi.npy")
cdef np.ndarray[np.int16_t, ndim=2] field_array = np.load("Z:cython_test/data/field_array.npy")
cdef np.ndarray[np.int16_t, ndim=1] unique_field = np.unique(field_array)
unique_field = unique_field[unique_field != -9999]
cdef int field_id
cdef np.ndarray[np.int16_t, ndim=1] f_ndvi_values
cdef double f_avg
for field_id in unique_field :
f_ndvi_values = ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)]
f_avg = np.mean(f_ndvi_values)
Setup.py code:
try:
from setuptools import setup
from setuptools import Extension
except ImportError:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
setup(ext_modules = cythonize('test_cy.pyx'),
include_dirs=[numpy.get_include()])
After some researching and running:
cython -a test_cy.pyx
It seems the index operation ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)]
is the bottleneck and is still relying on Python. I suspect I’m missing some vital declarations here. Including ndim
didn’t have any effect.
I’m fairly new to numpy as well so I'm probably missing something obvious.
Upvotes: 0
Views: 184
Reputation: 53089
Your problem looks fairly vectorizable to me, so Cython might not be the best approach. (Cython shines when there are unavoidable fine grained loops.) As your dtype is int16
there is only a limited range of possible labels, so using np.bincount
should be fairly efficient. Try something like (this is assuming all your valid values are >= 0 if that is not the case you'd have to shift - or (cheaper) view-cast to uint16
(since we are not doing any arithmetic on the labels that should be safe) - before using bincount
):
mask = (ndvi_array != -9999) & (field_array != -9999)
nd = ndvi_array[mask]
fi = field_array[mask]
counts = np.bincount(fi, minlength=2**15)
sums = np.bincount(fi, nd, minlength=2**15)
valid = counts != 0
avgs = sums[valid] / counts[valid]
Upvotes: 1