Reputation: 78
I have two Pandas DataFrames, the first one is named Points with columns 'longitude' and 'latitude' (i.e. geocoordinates); and, the second data frame is named Links with these columns: 'lon1' & 'lat1' for the first point and 'lon2' & 'lat2' for the second point then every pair of points given in each row creates a link/line. Also, for the Links DataFrame, it has a column named 'link_id'.
Say, there are about 10 points and 4,000 links. How would I associate every given point to the closest link/line by returning the 'link_id' and appending it as an additional column named 'closest_link' to the Points DataFrame?
Upvotes: 0
Views: 258
Reputation: 4455
Here's an approach that may work. Consider:
Let's call the new df, PointsLinks.
Here's some code that takes this approach:
import pandas as pd
import random
Points = pd.DataFrame( [ [ 1,2 ], [ 3,4 ], [ 5,6 ] ], columns = [ 'longitude', 'latitude' ] )
Links = pd.DataFrame( [ [ 'Link1', ( 4,3 ) , ( -1, -2 ) ], [ 'Link2', (10,10) , ( -5, -5 ) ] ], columns = [ 'linkid', 'lon1&lat1', 'lon2&lat2' ] )
print(Points)
print(Links)
#Step 1: https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
def DistanceToLink( pointlink ):
return random.randrange(10)
PointsLinks = cartesian_product_basic(Points,Links)
print( PointsLinks )
#Step 2: https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o
PointsLinks['distance'] = PointsLinks.apply( lambda row : DistanceToLink(row), axis = 'columns' )
print( PointsLinks )
#Step 3: Find the smallest distance per group https://stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups
closest = PointsLinks.sort_values( [ 'latitude', 'longitude', 'distance' ] , ascending = True ).groupby( [ 'latitude', 'longitude'] ).head(1)
# Drop the unnecessary columns
closest.drop( columns = ['lon1&lat1','lon2&lat2','distance'] , inplace=True)
print(closest)
Here are the dataframes created by the code:
Points:
longitude latitude
0 1 2
1 3 4
2 5 6
Links:
linkid lon1&lat1 lon2&lat2
0 Link1 (4, 3) (-1, -2)
1 Link2 (10, 10) (-5, -5)
and then PointsLinks (after adding the distance column with apply():
longitude latitude linkid lon1&lat1 lon2&lat2 distance
0 1 2 Link1 (4, 3) (-1, -2) 1
1 1 2 Link2 (10, 10) (-5, -5) 6
2 3 4 Link1 (4, 3) (-1, -2) 0
3 3 4 Link2 (10, 10) (-5, -5) 9
4 5 6 Link1 (4, 3) (-1, -2) 5
5 5 6 Link2 (10, 10) (-5, -5) 1
I didn't implement DistanceToLink
. I just put a random number generator there. Here's is what the first pointlink
object looks like ( it is a Series representing a row):
longitude 1
latitude 2
linkid Link1
lon1&lat1 (4, 3)
lon2&lat2 (-1, -2)
Now that you have the distance for each combination, you can find, and select, the PointLink pairs with the shortest distance (using pandas groupby sort within groups):
closest = PointsLinks.sort_values( [ 'latitude', 'longitude', 'distance' ] , ascending = True ).groupby( [ 'latitude', 'longitude'] ).head(1)
Here are the results:
longitude latitude linkid
0 1 2 Link1
2 3 4 Link1
5 5 6 Link2
Upvotes: 2