measure distance to nearest group of points - python

Question

I'm trying to measure the shortest euclidean distance of each point to the nearest group of points. Using below, I have 6 unique points displayed in x,y at two separate time points. I have a separate xy point recorded in x_ref, y_ref, which I pass a radius around. So for each point outside this radius, I want to find the shortest distance to any point within the radius. For points within the radius, just return 0.

calculate_distances measures the distance between each specific point and the remaining points. I'm hoping to return the distance to the nearest point within the radius.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform

df = pd.DataFrame({        
    'Time' : [1,1,1,1,1,1,2,2,2,2,2,2],                               
    'Item' : ['A','B','C','D','E','F','A','B','C','D','E','F'],      
    'x' : [5,5,8,3,6,2,6,7,4,2,7,6],
    'y' : [-2,0,-2,0,0,4,-1,2,-3,4,-4,2],     
    'x_ref' : [4,4,4,4,4,4,4,4,4,4,4,4],
    'y_ref' : [-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2],                          
    })

# Determine square distance
square_dist = (df['x_ref'] - df['x']) ** 2 + (df['y_ref'] - df['y']) ** 2
              
# Return df of items within radius
inside_radius = df[square_dist <= 3 ** 2].copy()

def calculate_distances(df):

    id_distances = pd.DataFrame(
        squareform(pdist(df[['x','y']].to_numpy())),  
        columns = df['Item'],
        index = df['Item'],
    )

    return id_distances

df_distances = df.groupby(['Time']).apply(calculate_distances).reset_index()

Intended output:

    Time Item  x  y  x_ref  y_ref  distance
0      1    A  5 -2      3     -2  0.000000 # within radius 0
1      1    B  5  0      3     -2  0.000000 # within radius 0
2      1    C  8 -2      3     -2  2.828427 # nearest within radius is E
3      1    D  3  0      3     -2  0.000000 # within radius 0
4      1    E  6  0      3     -2  0.000000 # within radius 0
5      1    F  2  4      3     -2  4.123106 # nearest within radius is D
6      2    A  6 -1      4     -2  0.000000 # within radius 0
7      2    B  7  2      4     -2  3.162278 # nearest within radius is A
8      2    C  4 -3      4     -2  0.000000 # within radius 0
9      2    D  2  4      4     -2  6.403124 # nearest within radius is A
10     2    E  7 -4      4     -2  3.162278 # nearest within radius is C or A
11     2    F  6  2      4     -2  3.000000 # nearest within radius is A

Pierre D · Accepted Answer

Here is a way to use scipy.spatial.KDTree, which is very useful when you intend to do many distance and neighbor searches.

import numpy as np
import pandas as pd
from scipy.spatial import KDTree

def within_radius_dist(z, radius, closed=False):
    center = z[['x_ref', 'y_ref']].mean()  # they should all be same
    z = z[['x', 'y']]
    dist_ubound = radius * 1.0001 if closed else radius
    dist, idx = KDTree(z).query(
        center, k=None, distance_upper_bound=dist_ubound)
    if closed:
        idx = [i for d, i in zip(dist, idx) if d <= radius]
    if idx:
        within = z.iloc[idx]
        dist, _ = KDTree(within).query(z)
    else:
        dist = np.nan
    return pd.Series(dist, index=z.index)

Application (here using your df as example):

>>> df.assign(distance=df.groupby('Time', group_keys=False).apply(
...     within_radius_dist, radius=3, closed=True))
    Time Item  x  y  x_ref  y_ref  distance
0      1    A  5 -2      3     -2  0.000000
1      1    B  5  0      3     -2  0.000000
2      1    C  8 -2      3     -2  3.000000
3      1    D  3  0      3     -2  0.000000
4      1    E  6  0      3     -2  1.000000
5      1    F  2  4      3     -2  4.123106
6      2    A  6 -1      4     -2  0.000000
7      2    B  7  2      4     -2  3.162278
8      2    C  4 -3      4     -2  0.000000
9      2    D  2  4      4     -2  6.403124
10     2    E  7 -4      4     -2  3.162278
11     2    F  6  2      4     -2  3.000000

Explanation:

The groupby('Time') makes sure we apply the function within_radius_dist() to each group by time.
Inside the function, the first KDTree query finds the points inside the sphere (circle, here, since this problem is 2D, but this can be generalized to nD) of given radius centered on (x_ref, y_ref).
Since the distance_upper_bound argument is exclusive (i.e. KDTree query returns only distances strictly smaller than this), in the case where we want to include points at the radius (when closed=True), then we need to do a bit of extra processing: add a small fraction to the radius, then clip.
Note also that, by default, the p=2 norm (Euclidean norm) is used, but you could use other norms as well.
within are these points inside the sphere.
(note: if there are no such point, we return NaN for all distances).
The second KDTree query looks for the nearest distance of all our points (within the group) to those within points. This conveniently returns 0 for points that are within the sphere (because that's the distance to themselves) and the distance to the nearest point within for the other points. So that's our result right there.
We return the result as a Series, so pandas knowns to put that in shape properly, and finally assign that to a column called 'distance'.

Last observation: the desired result provided in the original question seems to ignore x_ref, y_ref and use a single center=(4, -2). In the first group (Time == 1), the correct distance for C is 3.0 (distance to A) and E is not within the circle.

Supplement

If you are interested in also capturing which nearest neighbor is found for each point:

def within_radius_dist(z, radius, closed=False):
    center = z[['x_ref', 'y_ref']].mean()  # they should all be same
    z = z[['x', 'y']]
    dist_ubound = radius * 1.0001 if closed else radius
    dist, idx = KDTree(z).query(
        center, k=None, distance_upper_bound=dist_ubound)
    if closed:
        idx = [i for d, i in zip(dist, idx) if d <= radius]
    if idx:
        within = z.iloc[idx]
        dist, idx = KDTree(within).query(z)
        neigh_idx = within.index[idx]
    else:
        dist = np.nan
        neigh_idx = None
    return pd.DataFrame({'distance': dist, 'neighbor': neigh_idx}, index=z.index)

and then:

out = pd.concat([df, df.groupby('Time', group_keys=False).apply(
    within_radius_dist, radius=3, closed=True)], axis=1)
out.assign(neigh_item=out.loc[out.neighbor, 'Item'].values)

Output:

    Time Item  x  y  x_ref  y_ref  distance  neighbor neigh_item
0      1    A  5 -2      3     -2  0.000000         0          A
1      1    B  5  0      3     -2  0.000000         1          B
2      1    C  8 -2      3     -2  3.000000         0          A
3      1    D  3  0      3     -2  0.000000         3          D
4      1    E  6  0      3     -2  1.000000         1          B
5      1    F  2  4      3     -2  4.123106         3          D
6      2    A  6 -1      4     -2  0.000000         6          A
7      2    B  7  2      4     -2  3.162278         6          A
8      2    C  4 -3      4     -2  0.000000         8          C
9      2    D  2  4      4     -2  6.403124         6          A
10     2    E  7 -4      4     -2  3.162278         8          C
11     2    F  6  2      4     -2  3.000000         6          A

measure distance to nearest group of points - python

Answers (1)

Related Questions