Vainmonde De Courtenay
Vainmonde De Courtenay

Reputation: 376

List all points within distance to line

I have a list of coordinates a. I want to specify a radius r and then list all points on a 2D grid within the specified radius of any point in a, together with the minimum distance of each of those grid-points to any point in a. Since a is substantial (~1000-2000 points) I want to be as efficient as possible.

My implementation so far uses this code to list all coordinates within the given radius of a point; then iterates that over all coordinates in a; then flattens the output, takes a set (there will be MANY duplicates) - this is the set of coordinates within the specified radius of any point on the line - then calculates the minimum distances of that set to a using scipy.spatial.distance.cdist:

import numpy as np
np.random.seed(123)
a = np.random.randint(50, size=(100, 2))

def collect(coord, sigma: float =3.0):
    """ create a small collection of points in a neighborhood of some point 
    """
    neighborhood = []
    
    x=coord[0]
    y=coord[1]

    X = int(sigma)
    for i in range(-X, X + 1):
        Y = int(pow(sigma * sigma - i * i, 1/2))
        for j in range(-Y, Y + 1):
            neighborhood.append((x + i, y + j))

    return neighborhood

rad = 10
coord_list = [collect(coord, rad) for coord in a]
coord_set = np.array(list(set([val for sublist in coord_list for val in sublist])))

from scipy.spatial import distance

dists = distance.cdist(coord_set, a).min(axis=1)

Result:

coord_set
> array([[26, 21],
       [18, 17],
       [50,  6],
       ...,
       [14, 47],
       [15, 12],
       [ 7,  8]])

dists
> array([2.23606798, 5.65685425, 1.41421356, ..., 3.16227766, 3.        ,
       1.41421356])

Does anyone have any ways I can improve this and do it more efficiently?

Upvotes: 1

Views: 1430

Answers (2)

Pierre D
Pierre D

Reputation: 26201

You can adapt the answer I gave to your other linked question in a very straightforward way. The outcome is also very fast (~425 ms for 10K points in a).

Edit: For sparse cases (where the number of points actually filtered is a small portion of the complete grid), I also add a sparse_grid_within_radius() version below.

(Important: use scipy >= 1.6.0, where the Python implementation of KDTree has been replaced by cKDTree. See the release notes).

# please use scipy >= 1.6.0
from scipy.spatial import KDTree


def grid_within_radius(a, radius=10, resolution=(1,1), origin=(0,0)):
    pmin = (a - origin).min(axis=0) - radius
    pmax = (a - origin).max(axis=0) + radius
    imin = np.floor(pmin / resolution)
    imax = np.ceil(pmax / resolution) + 1
    xv, yv = np.meshgrid(np.arange(imin[0], imax[0]), np.arange(imin[1], imax[1]))
    grid = np.stack((xv, yv), axis=-1) * resolution + origin
    dist, _ = KDTree(a).query(grid, k=1, distance_upper_bound=radius + 0.01)
    idx = dist <= radius
    return grid[idx], dist[idx]

Usage

First, the OP's example:

np.random.seed(123)
a = np.random.randint(10, size=(100, 2))
g, d = grid_within_radius(a)

To compare with the OP's result, we need to sort their solution (coord_set, dists):

def sort2d(a, other=None):
    other = a if other is None else other
    return other[np.lexsort((a[:, 0], a[:, 1]))]

With this, we can check that our solution is the same:

>>> np.allclose(g, sort2d(coord_set))
True

>>> np.allclose(d, sort2d(coord_set, dists))
True

And another example (with a different grid resolution and radius):

g, d = grid_within_radius(a, radius=0.6, resolution=(.11, .47))
plt.scatter(*a.T, s=10, c='r')
plt.scatter(*g.T, s=1)

Speed

a = np.random.randint(1000, size=(10_000, 2))
%timeit grid_within_radius(a)
# 425 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sparse version

The above works well when the number of points returned is a significant portion of the total grid (e.g. 30% or more). But for very sparse cases (e.g. where a combination of the bounding box of a, the radius and the grid resolution result in generating a huge grid and then eliminate most of it), then it is slow. To illustrate, the following is a sparse case for a = np.random.randint(0, 200, (10, 2)):

The version below fixes that by generating "patches" of grid around quantized locations instead of the full grid.

import numpy as np
import pandas as pd
from scipy.spatial import KDTree


def unique_rows(a):
    # np.unique(a, axis=0) is slow, in part because it sorts;
    # using pandas.drop_duplicates() instead.
    # see https://github.com/numpy/numpy/issues/11136#issuecomment-822000680
    return pd.DataFrame(a).drop_duplicates().values

def sparse_grid_within_radius(a, radius=10, resolution=(1,1), origin=(0,0), rbox=1):
    resolution = np.array(resolution)
    box_size = radius * rbox
    nxy0 = np.floor(radius / resolution).astype(int)
    nxy1 = np.ceil((box_size + radius) / resolution).astype(int) + 1
    patch = np.stack(list(map(np.ravel, np.indices(nxy0 + nxy1))), axis=1) - nxy0
    
    ar = np.floor((a - origin) / box_size) * box_size
    ia = unique_rows(np.floor(ar / resolution).astype(int))
    grid = unique_rows((ia[:, None] + patch).reshape(-1, 2)) * resolution + origin

    dist, _ = KDTree(a).query(grid, k=1, distance_upper_bound=radius * 1.01)
    idx = dist <= radius
    return grid[idx], dist[idx]

With this, even very sparse results are fast.

Example

a = np.random.randint(0, 4000, (100, 2))

%timeit g, d = grid_within_radius(a)
# 3.88 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit gs, ds = sparse_grid_within_radius(a)
# 29.6 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Speed comparison

import perfplot

perfplot.show(
    setup=lambda n: np.random.randint(0, n, (100, 2)),
    kernels=[grid_within_radius, sparse_grid_within_radius],
    n_range=[2 ** k for k in range(14)],
    equality_check=lambda a, b: np.allclose(sort2d(a[0]), sort2d(b[0])),
)

Upvotes: 3

Mad Physicist
Mad Physicist

Reputation: 114230

Let's inspect your implementation carefully. Notice that you are already poised and partially compute a distance metric in collect.

  1. What if you made neighborhood a dict instead of a list, keyed by grid point with a value of minimum distance. You could entirely eliminate the call to set and cdist.
  2. If a can contain float values, you should your range from int(coord[0] - rad) to int(coord[0] + rad) + 1. int(0.5 - 10) is -9, while int(0.5) - 10 is -10.
  3. You can compare to the squared radius, so you don't need to take the square root except once for the final result.

Points 2 and 3 are relatively minor improvements.

Here is an example:

rad = 10
rad2 = rad**2

points = {}

for coord in a:
    for x in range(int(np.ceil(coord[0] - rad)), int(coord[0] + rad) + 1):
        dx2 = (x - coord[0])**2
        chord = np.sqrt(rad2 - dx2)
        for y in range(int(np.ceil(coord[1] - chord)), int(coord[1] + chord) + 1):
            d2 = dx2 + (y - coord[1])**2
            key = x, y
            points[key] = min(d2, points.get(key, rad2))

To convert this into numpy arrays:

grids = np.array(list(points.keys()))
nearest = np.sqrt(np.fromiter(points.values(), count=len(points), dtype=float))

Upvotes: 0

Related Questions