christaylor
christaylor

Reputation: 391

How to find the distance between 2 points in 2 different dataframes in pandas?

I've got two dataframes, each with a set of coordinates. Dataframe 1 is a list of biomass sites, with coordinates in columns 'lat' and 'lng'. Dataframe 2 is a list of postcode coordinates, linked to sale price, with coordinates in columns 'pc_lat' and 'pc_lng'.

I've used this stackoverflow question to work out the closest biomass site to each property. This is the code I am using:

def dist(lat1, long1, lat2, long2):
return np.abs((lat1-lat2)+(long1-long2))

def find_site(lat, long):
    distances = biomass.apply(
        lambda row: dist(lat, long, row['lat'], row['lng']), 
        axis=1)
    return biomass.loc[distances.idxmin(),'Site Name']

hp1995['BiomassSite'] = hp1995.apply(
    lambda row: find_site(row['pc_lat'], row['pc_long']), 
    axis=1)

print(hp1995.head())

This has worked well, in that I've got the name of the closest Biomass generation site, however I want to know the distance calculated between these two sites.

  1. How would I calculate the distance?

  2. What metric would the output distance be in? I am trying to find properties within 2km from the biomass site.

Upvotes: 1

Views: 2075

Answers (1)

DarkCygnus
DarkCygnus

Reputation: 7838

To calculate distance between two global coordinates you should use the Haversine Formula, based on this page I have implemented the following method:

import math
def distanceBetweenCm(lat1, lon1, lat2, lon2):
    dLat = math.radians(lat2-lat1)
    dLon = math.radians(lon2-lon1)

    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)

    a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLon/2) * math.sin(dLon/2) * math.cos(lat1) * math.cos(lat2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    return c * 6371 * 100000 #multiply by 100k to get distance in cm

You can also modify it to return different units, by multiplying by different powers of 10. In the example a multiplication by 100k results in units in centimeters. Without multiplying the method returns distance in km. From there you could perform more unit conversions if necessary .

Edit: As suggested in the comments, one possible optimization for this would be using power operators instead of regular multiplication, like this:

a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)

Take a look at this question to read more about different speed complexities of calculating powers in python.


Edit: Some years later I needed to calculate Haversine Distances between lat,lon points again. This answer was still useful for me, as it calculates the correct distance* and needs no external libraries.

However, if we go down the small details, we can see that the algorithm I provided 'hardcodes' Earth's radius to 6371, and it does not consider that Earth's radius is not uniform (spoiler alert: its smaller the closer to poles, bigger closer to equator).

In most cases it's likely that we can live with this, as it introduces a small approximation error the "farther" the "real" radius is for the points you input (doing some measurements in a more precise way/source, worst case error was below 2 meters, and in most cases it was sub-metric).

An alternative way would be to use a library that considers Geodesic models to account for the real radius. One such library I found is geopy, specifically the geopy.distance.geodesic() method. This however, will introduce an external dependency as tradeoff.

Upvotes: 3

Related Questions