Reputation: 368
I'm trying to measure the shortest euclidean distance of each point to the nearest group of points. Using below, I have 6 unique points displayed in x,y
at two separate time points. I have a separate xy point recorded in x_ref, y_ref
, which I pass a radius around. So for each point outside this radius, I want to find the shortest distance to any point within the radius. For points within the radius, just return 0.
calculate_distances
measures the distance between each specific point and the remaining points. I'm hoping to return the distance to the nearest point within the radius.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,2,2,2,2,2,2],
'Item' : ['A','B','C','D','E','F','A','B','C','D','E','F'],
'x' : [5,5,8,3,6,2,6,7,4,2,7,6],
'y' : [-2,0,-2,0,0,4,-1,2,-3,4,-4,2],
'x_ref' : [4,4,4,4,4,4,4,4,4,4,4,4],
'y_ref' : [-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2],
})
# Determine square distance
square_dist = (df['x_ref'] - df['x']) ** 2 + (df['y_ref'] - df['y']) ** 2
# Return df of items within radius
inside_radius = df[square_dist <= 3 ** 2].copy()
def calculate_distances(df):
id_distances = pd.DataFrame(
squareform(pdist(df[['x','y']].to_numpy())),
columns = df['Item'],
index = df['Item'],
)
return id_distances
df_distances = df.groupby(['Time']).apply(calculate_distances).reset_index()
Intended output:
Time Item x y x_ref y_ref distance
0 1 A 5 -2 3 -2 0.000000 # within radius 0
1 1 B 5 0 3 -2 0.000000 # within radius 0
2 1 C 8 -2 3 -2 2.828427 # nearest within radius is E
3 1 D 3 0 3 -2 0.000000 # within radius 0
4 1 E 6 0 3 -2 0.000000 # within radius 0
5 1 F 2 4 3 -2 4.123106 # nearest within radius is D
6 2 A 6 -1 4 -2 0.000000 # within radius 0
7 2 B 7 2 4 -2 3.162278 # nearest within radius is A
8 2 C 4 -3 4 -2 0.000000 # within radius 0
9 2 D 2 4 4 -2 6.403124 # nearest within radius is A
10 2 E 7 -4 4 -2 3.162278 # nearest within radius is C or A
11 2 F 6 2 4 -2 3.000000 # nearest within radius is A
Upvotes: 3
Views: 560
Reputation: 26201
Here is a way to use scipy.spatial.KDTree
, which is very useful when you intend to do many distance and neighbor searches.
import numpy as np
import pandas as pd
from scipy.spatial import KDTree
def within_radius_dist(z, radius, closed=False):
center = z[['x_ref', 'y_ref']].mean() # they should all be same
z = z[['x', 'y']]
dist_ubound = radius * 1.0001 if closed else radius
dist, idx = KDTree(z).query(
center, k=None, distance_upper_bound=dist_ubound)
if closed:
idx = [i for d, i in zip(dist, idx) if d <= radius]
if idx:
within = z.iloc[idx]
dist, _ = KDTree(within).query(z)
else:
dist = np.nan
return pd.Series(dist, index=z.index)
Application (here using your df
as example):
>>> df.assign(distance=df.groupby('Time', group_keys=False).apply(
... within_radius_dist, radius=3, closed=True))
Time Item x y x_ref y_ref distance
0 1 A 5 -2 3 -2 0.000000
1 1 B 5 0 3 -2 0.000000
2 1 C 8 -2 3 -2 3.000000
3 1 D 3 0 3 -2 0.000000
4 1 E 6 0 3 -2 1.000000
5 1 F 2 4 3 -2 4.123106
6 2 A 6 -1 4 -2 0.000000
7 2 B 7 2 4 -2 3.162278
8 2 C 4 -3 4 -2 0.000000
9 2 D 2 4 4 -2 6.403124
10 2 E 7 -4 4 -2 3.162278
11 2 F 6 2 4 -2 3.000000
Explanation:
groupby('Time')
makes sure we apply the function within_radius_dist()
to each group by time.KDTree
query finds the points inside the sphere (circle, here, since this problem is 2D, but this can be generalized to nD) of given radius centered on (x_ref, y_ref)
.distance_upper_bound
argument is exclusive (i.e. KDTree
query returns only distances strictly smaller than this), in the case where we want to include points at the radius (when closed=True
), then we need to do a bit of extra processing: add a small fraction to the radius, then clip.p=2
norm (Euclidean norm) is used, but you could use other norms as well.within
are these points inside the sphere.NaN
for all distances).KDTree
query looks for the nearest distance of all our points (within the group) to those within
points. This conveniently returns 0
for points that are within the sphere (because that's the distance to themselves) and the distance to the nearest point within for the other points. So that's our result right there.Series
, so pandas knowns to put that in shape properly, and finally assign that to a column called 'distance'
.Last observation: the desired result provided in the original question seems to ignore x_ref, y_ref
and use a single center=(4, -2)
. In the first group (Time == 1
), the correct distance for C
is 3.0 (distance to A) and E
is not within the circle.
Supplement
If you are interested in also capturing which nearest neighbor is found for each point:
def within_radius_dist(z, radius, closed=False):
center = z[['x_ref', 'y_ref']].mean() # they should all be same
z = z[['x', 'y']]
dist_ubound = radius * 1.0001 if closed else radius
dist, idx = KDTree(z).query(
center, k=None, distance_upper_bound=dist_ubound)
if closed:
idx = [i for d, i in zip(dist, idx) if d <= radius]
if idx:
within = z.iloc[idx]
dist, idx = KDTree(within).query(z)
neigh_idx = within.index[idx]
else:
dist = np.nan
neigh_idx = None
return pd.DataFrame({'distance': dist, 'neighbor': neigh_idx}, index=z.index)
and then:
out = pd.concat([df, df.groupby('Time', group_keys=False).apply(
within_radius_dist, radius=3, closed=True)], axis=1)
out.assign(neigh_item=out.loc[out.neighbor, 'Item'].values)
Output:
Time Item x y x_ref y_ref distance neighbor neigh_item
0 1 A 5 -2 3 -2 0.000000 0 A
1 1 B 5 0 3 -2 0.000000 1 B
2 1 C 8 -2 3 -2 3.000000 0 A
3 1 D 3 0 3 -2 0.000000 3 D
4 1 E 6 0 3 -2 1.000000 1 B
5 1 F 2 4 3 -2 4.123106 3 D
6 2 A 6 -1 4 -2 0.000000 6 A
7 2 B 7 2 4 -2 3.162278 6 A
8 2 C 4 -3 4 -2 0.000000 8 C
9 2 D 2 4 4 -2 6.403124 6 A
10 2 E 7 -4 4 -2 3.162278 8 C
11 2 F 6 2 4 -2 3.000000 6 A
Upvotes: 3