Reputation: 909
I have raster containing spatial ecological habitat data which I've converted to a 2-dimensional numpy array. In this array, values of 1 = data, and 0 = no data. From this data I want to generate an array of containing all data cell pairs where the distance between each cell is less than a maximum Euclidean cutoff distance (i.e. 2 cells apart).
I've found this answer useful, but the answers there appear to first measure all pairwise distances, then subsequently threshold the results by a maximum cutoff. My datasets are large (over 1 million data cells in a 13500*12000 array), so any pairwise distance measure that tries to calculate distances between all pairs of cells will fail: I need a solution which somehow stops looking for possible neighbours outside of a certain search radius (or something similar).
I've experimented with scipy.spatial.distance.pdist
, but so far haven't had luck applying it to either my two-dimensional data, or finding a way to prevent pdist
from calculating distances between even distant pairs of cells. I've attached an example array and a desired output array for maximum Euclidean cutoff distance = 2 cells:
import numpy as np
import matplotlib.pyplot as plt
# Example 2-D habitat array (1 = data)
example_array = np.array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# Plot example array
plt.imshow(example_array, cmap="spectral", interpolation='nearest')
Upvotes: 2
Views: 656
Reputation: 8137
I have to confess my numpy is weak -- maybe there is a way to do it directly. Nonetheless, the problem is not difficult in pure Python. The following code will output pairs of x/y coordinates of your matching data. There are a lot of potential optimizations that could obscure the code and make it go faster, but given the size of your data set and the size (2.0) of your example radius, I doubt any of those are worthwhile (with the possible exception of creating numpy views into the arrays instead of sublists).
Updated -- The code has had a couple of bugs fixed -- (1) it was looking too far to the left on lines that were below the starting point, and (2) it was not doing the right thing near the left edge. The invocation of the function now uses a radius of 2.5 to show how additional pairs can be picked up.
example_array = [[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
def findpairs(mylist, radius = 2.0):
"""
Find pairs with data within a given radius.
If we work from the top of the array down, we never
need to look up (because we already would have found
those, and we never need to look left on the same line.
"""
# Create the parameters of a half circle, which is
# the relative beginning and ending X coordinates to
# search for each Y line starting at this one and
# working down. To avoid duplicates and extra work,
# not only do we not look up, we never look left on
# the same line as what we are matching, but we do
# on subsequent lines.
semicircle = []
x = 1
while x:
y = len(semicircle)
x = int(max(0, (radius ** 2 - y ** 2)) ** 0.5)
# Don't look back on same line...
semicircle.append((-x if y else 1, x + 1))
# The maximum number of y lines we will search
# at a time.
max_y = len(semicircle)
for y_start in range(len(mylist)):
sublists = enumerate(mylist[y_start:y_start + max_y], y_start)
sublists = zip(semicircle, sublists)
check = (x for (x, value) in enumerate(mylist[y_start]) if value)
for x_start in check:
for (x_lo, x_hi), (y, ylist) in sublists:
# Deal with left edge problem
x_lo = max(0, x_lo + x_start)
xlist = ylist[x_lo: x_start + x_hi]
for x, value in enumerate(xlist, x_lo):
if value:
yield (x_start, y_start), (x, y)
print(list(findpairs(example_array, 2.5)))
Execution time is going to be highly data dependent. For grins, I created arrays of the size you specified (13500 x 12000) to test timing. I used a bigger radius (3.0 instead of 2.0) and tried two cases: no matches, and every match. To avoid reallocating lists over and over, I simply ran the iterator and tossed the results. The code to do that is below. For a best-case (empty) array, it ran on my machine in 7 seconds; the time for the worst-case (all 1s) array was around 12 minutes.
def dummy(val):
onelist = 13500 * [val]
listolists = 12000 * [onelist]
for i in findpairs(listolists, 3.0):
pass
dummy(0)
dummy(1)
Upvotes: 2