chintan s
chintan s

Reputation: 6488

Speed up getting distance between two lat and lon

I have two DataFrame containing Lat and Lon. I want to find distance from one (Lat, Lon) pair to ALL (Lat, Lon) from another DataFrame and get the minimum. The package that I am using geopy. The code is as follows:

from geopy import distance
import numpy as np

distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
    target = (row1["LAT"], row1["LON"])
    count = count + 1
    print(count)
    for id2, row2 in df2.iterrows():
        point = (row2["LAT"], row2["LON"])
        distanceMiles.append(distance.distance(target, point).miles)

    closestPoint = np.argmin(distanceMiles)
    distanceMiles = []

The problem is that df1 has 168K rows and df2 has 1200 rows. How do I make it faster?

Upvotes: 1

Views: 2874

Answers (4)

Qusai Alothman
Qusai Alothman

Reputation: 2072

Leaving this here in case anyone needs it in the future:

If you need only the minimum distance, then you don't have to bruteforce all the pairs. There are some data structures that can help you solve this in O(n*log(n)) time complexity, which is way faster than the bruteforce method.

For example, you can use a generalized KNearestNeighbors (with k=1) algorithm to do exactly that, given that you pay attention to your points being on a sphere, not a plane. See this SO answer for an example implementation using sklearn.

There seems to be a few libraries to solve this too, like sknni and GriSPy.

Here's also another question that talks a bit about the theory.

Upvotes: 1

KostyaEsmukov
KostyaEsmukov

Reputation: 898

geopy.distance.distance uses geodesic algorithm by default, which is rather slow but more accurate. If you can trade accuracy for speed, you can use great_circle, which is ~20 times faster:

In [4]: %%timeit
   ...: distance.distance(newport_ri, cleveland_oh).miles
   ...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit
   ...: distance.great_circle(newport_ri, cleveland_oh).miles
   ...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Also you may use multiprocessing to parallelize the computation:

from multiprocessing import Pool
from geopy import distance
import numpy as np


def compute(points):
    target, point = points
    return distance.great_circle(target, point).miles


with Pool() as pool:
    for id1, row1 in df1.iterrows():
        target = (row1["LAT"], row1["LON"])
        distanceMiles = pool.map(
            compute,
            (
                (target, (row2["LAT"], row2["LON"]))
                for id2, row2 in df2.iterrows()
            )
        )
        closestPoint = np.argmin(distanceMiles)

Upvotes: 6

Valentino
Valentino

Reputation: 7361

If you need to check all the the pairs by brute force, I think the following approach is the best you can do.
Looping directly on columns is usually slightly faster than iterrows, and the vectorized approach replacing the inner loop saves time too.

for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
    target = (lat1, lon1)
    count = count + 1
    #    print(count) #printing is also time expensive
    df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
    closestpoint = df2['dist'].min() #if you want the minimum distance
    closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.

Upvotes: 0

Akshay Sehgal
Akshay Sehgal

Reputation: 19307

This should run much faster if you utilize itertools instead of explicit for loops. Inline comments should help you understand whats happening at each step.

import numpy as np
import itertools
from geopy import distance


#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})


#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))


#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])

#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
 44.29982320107605,
 25.88823239877388,
 23.877570442142783,
 29.9351451072828,
 ...]

Finally, If you are working with a massive dataset, then I would recommend using multiprocessing library to map the itertools.starmap to different cores and asynchronously compute the distance values. Python Multiprocessing library now supports starmap.

Upvotes: 0

Related Questions