Helk
Helk

Reputation: 121

Finding closest point

I have a dataframe of all_points and their coordinates:

all_points =
   point_id   latitude  longitude  
0          1  41.894577 -87.645307  
1          2  41.894647 -87.640426 
2          3  41.894713 -87.635513 
3          4  41.894768 -87.630629  
4          5  41.894830 -87.625793 

and a dataframe of the parent_points:

parent_pts = 
       parent_id
0       1             
1       2     

I want to create a column on the all_points dataframe with the closest parent point to each point.

This is my trial, but I might be making it more complicated:

from scipy.spatial.distance import cdist

def closest_point(point, points):
    """ Find closest point from a list of points. """
    return points[cdist([point], points).argmin()]

def match_value(df, col1, x, col2):
    """ Match value x from col1 row to value in col2. """
    return df[df[col1] == x][col2].values[0]

all_points['point'] = [(x, y) for x,y in zip(all_points['latitude'], all_points['longitude'])]
parent_pts['point'] = all_points['point'][all_points['point_id   '].isin(parent_pts['parent_id'])]

all_points['parent'] = [match_value(parent_pts, 'point', x, 'parent_id') for x in all_points['closest']]

The parent_point is a subset of the all_points.

I get this error when I try to use the closest_point function:

ValueError: XB must be a 2-dimensional array.

Upvotes: 1

Views: 680

Answers (2)

AGN Gazer
AGN Gazer

Reputation: 8388

First, let me start by saying that it appears to me that your longitudes and latitudes are locations on Earth. Assuming that Earth is a sphere, the distance between two points should be computed as the length along great-circle distance and not as Euclidean distance that you get using cdist.

The easiest approach from the programming point of view (except for the learning curve for you) is to use the astropy package. They have quite an OK documentation sometimes with useful examples, see, e.g., match_coordinates_sky() or catalog matching with astropy.

Then you might do something like this:

>>> from astropy.units import Quantity
>>> from astropy.coordinates import match_coordinates_sky, SkyCoord, EarthLocation
>>> from pandas import DataFrame
>>> import numpy as np
>>>
>>> # Create your data as I understood it:
>>> all_points = DataFrame({'point_id': np.arange(1,6), 'latitude': [41.894577, 41.894647, 41.894713, 41.894768, 41.894830], 'longitude': [-87.645307, -87.640426, -87.635513, -87.630629, -87.625793 ]})
>>> parent_pts = DataFrame({'parent_id': [1, 2]})
>>>
>>> # Create a frame with the coordinates of the "parent" points:
>>> parent_coord = all_points.loc[all_points['point_id'].isin(parent_pts['parent_id'])]
>>> print(parent_coord)
    latitude  longitude  point_id
0  41.894577 -87.645307         1
1  41.894647 -87.640426         2
>>>
>>> # Create coordinate array for "points" (in principle the below statements
>>> # could be combined into a single one):
>>> all_lon = Quantity(all_points['longitude'], unit='deg')
>>> all_lat = Quantity(all_points['latitude'], unit='deg')
>>> all_pts = SkyCoord(EarthLocation.from_geodetic(all_lon, all_lat).itrs, frame='itrs')
>>>
>>> # Create coordinate array for "parent points":
>>> parent_lon = Quantity(parent_coord['longitude'], unit='deg')
>>> parent_lat = Quantity(parent_coord['latitude'], unit='deg')
>>> parent_catalog = SkyCoord(EarthLocation.from_geodetic(parent_lon, parent_lat).itrs, frame='itrs')
>>> 
>>> # Get the indices (in parent_catalog) of parent coordinates
>>> # closest to each point:
>>> matched_indices = match_coordinates_sky(all_pts, parent_catalog)[0]
Downloading http://maia.usno.navy.mil/ser7/finals2000A.all
|========================================================================| 3.1M/3.1M (100.00%)         0s
>>> all_points['parent_id'] = [parent_pts['parent_id'][idx] for idx in matched_indices]
>>> print(all_points)
    latitude  longitude  point_id  parent_id
0  41.894577 -87.645307         1          1
1  41.894647 -87.640426         2          2
2  41.894713 -87.635513         3          2
3  41.894768 -87.630629         4          2
4  41.894830 -87.625793         5          2

I would like to add that match_coordinates_sky() returns not only matching indices but also a list of angular separations between the data point and the matched "parent" point as well as distance in meters between the data points and the matched "parent" point. It may be useful for your problem.

Upvotes: 2

DJK
DJK

Reputation: 9274

You could do this instead if you want to use euclidean distance and use the index as your point id instead

def findClose(inX,inY,cIndex,X,Y):
    X,Y = X - inX,Y-inY
    X,Y = X**2,Y**2
    dist = np.sqrt(np.sum([X, Y], axis=0))
    dist[cIndex] = np.max(dist)*100 # ensure not the current index
    return np.argmin(dist)

X,Y = all_points['latitude'].as_matrix(),all_points['longitude'].as_matrix()
all_points['point_id'] = all_points.index
all_points['Parents'] = all_points.apply(lambda row: 
                    findClose(row['latitude'],row['longitude'],
                    row['point_id'],X,Y),axis=1)

which yields

print all_points

   point_id   latitude  longitude  Parents
0         0  41.894577 -87.645307        1
1         1  41.894647 -87.640426        0
2         2  41.894713 -87.635513        3
3         3  41.894768 -87.630629        4
4         4  41.894830 -87.625793        3

Upvotes: 1

Related Questions