Reputation: 121
I have a dataframe of all_points and their coordinates:
all_points =
point_id latitude longitude
0 1 41.894577 -87.645307
1 2 41.894647 -87.640426
2 3 41.894713 -87.635513
3 4 41.894768 -87.630629
4 5 41.894830 -87.625793
and a dataframe of the parent_points:
parent_pts =
parent_id
0 1
1 2
I want to create a column on the all_points dataframe with the closest parent point to each point.
This is my trial, but I might be making it more complicated:
from scipy.spatial.distance import cdist
def closest_point(point, points):
""" Find closest point from a list of points. """
return points[cdist([point], points).argmin()]
def match_value(df, col1, x, col2):
""" Match value x from col1 row to value in col2. """
return df[df[col1] == x][col2].values[0]
all_points['point'] = [(x, y) for x,y in zip(all_points['latitude'], all_points['longitude'])]
parent_pts['point'] = all_points['point'][all_points['point_id '].isin(parent_pts['parent_id'])]
all_points['parent'] = [match_value(parent_pts, 'point', x, 'parent_id') for x in all_points['closest']]
The parent_point is a subset of the all_points.
I get this error when I try to use the closest_point function:
ValueError: XB must be a 2-dimensional array.
Upvotes: 1
Views: 680
Reputation: 8388
First, let me start by saying that it appears to me that your longitudes and latitudes are locations on Earth. Assuming that Earth is a sphere, the distance between two points should be computed as the length along great-circle distance and not as Euclidean distance that you get using cdist
.
The easiest approach from the programming point of view (except for the learning curve for you) is to use the astropy
package. They have quite an OK documentation sometimes with useful examples, see, e.g., match_coordinates_sky()
or catalog matching with astropy.
Then you might do something like this:
>>> from astropy.units import Quantity
>>> from astropy.coordinates import match_coordinates_sky, SkyCoord, EarthLocation
>>> from pandas import DataFrame
>>> import numpy as np
>>>
>>> # Create your data as I understood it:
>>> all_points = DataFrame({'point_id': np.arange(1,6), 'latitude': [41.894577, 41.894647, 41.894713, 41.894768, 41.894830], 'longitude': [-87.645307, -87.640426, -87.635513, -87.630629, -87.625793 ]})
>>> parent_pts = DataFrame({'parent_id': [1, 2]})
>>>
>>> # Create a frame with the coordinates of the "parent" points:
>>> parent_coord = all_points.loc[all_points['point_id'].isin(parent_pts['parent_id'])]
>>> print(parent_coord)
latitude longitude point_id
0 41.894577 -87.645307 1
1 41.894647 -87.640426 2
>>>
>>> # Create coordinate array for "points" (in principle the below statements
>>> # could be combined into a single one):
>>> all_lon = Quantity(all_points['longitude'], unit='deg')
>>> all_lat = Quantity(all_points['latitude'], unit='deg')
>>> all_pts = SkyCoord(EarthLocation.from_geodetic(all_lon, all_lat).itrs, frame='itrs')
>>>
>>> # Create coordinate array for "parent points":
>>> parent_lon = Quantity(parent_coord['longitude'], unit='deg')
>>> parent_lat = Quantity(parent_coord['latitude'], unit='deg')
>>> parent_catalog = SkyCoord(EarthLocation.from_geodetic(parent_lon, parent_lat).itrs, frame='itrs')
>>>
>>> # Get the indices (in parent_catalog) of parent coordinates
>>> # closest to each point:
>>> matched_indices = match_coordinates_sky(all_pts, parent_catalog)[0]
Downloading http://maia.usno.navy.mil/ser7/finals2000A.all
|========================================================================| 3.1M/3.1M (100.00%) 0s
>>> all_points['parent_id'] = [parent_pts['parent_id'][idx] for idx in matched_indices]
>>> print(all_points)
latitude longitude point_id parent_id
0 41.894577 -87.645307 1 1
1 41.894647 -87.640426 2 2
2 41.894713 -87.635513 3 2
3 41.894768 -87.630629 4 2
4 41.894830 -87.625793 5 2
I would like to add that match_coordinates_sky()
returns not only matching indices but also a list of angular separations between the data point and the matched "parent" point as well as distance in meters between the data points and the matched "parent" point. It may be useful for your problem.
Upvotes: 2
Reputation: 9274
You could do this instead if you want to use euclidean distance and use the index as your point id instead
def findClose(inX,inY,cIndex,X,Y):
X,Y = X - inX,Y-inY
X,Y = X**2,Y**2
dist = np.sqrt(np.sum([X, Y], axis=0))
dist[cIndex] = np.max(dist)*100 # ensure not the current index
return np.argmin(dist)
X,Y = all_points['latitude'].as_matrix(),all_points['longitude'].as_matrix()
all_points['point_id'] = all_points.index
all_points['Parents'] = all_points.apply(lambda row:
findClose(row['latitude'],row['longitude'],
row['point_id'],X,Y),axis=1)
which yields
print all_points
point_id latitude longitude Parents
0 0 41.894577 -87.645307 1
1 1 41.894647 -87.640426 0
2 2 41.894713 -87.635513 3
3 3 41.894768 -87.630629 4
4 4 41.894830 -87.625793 3
Upvotes: 1