Reputation: 6488
I have two DataFrame containing Lat and Lon. I want to find distance from one (Lat, Lon)
pair to ALL (Lat, Lon)
from another DataFrame and get the minimum. The package that I am using geopy
. The code is as follows:
from geopy import distance
import numpy as np
distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
count = count + 1
print(count)
for id2, row2 in df2.iterrows():
point = (row2["LAT"], row2["LON"])
distanceMiles.append(distance.distance(target, point).miles)
closestPoint = np.argmin(distanceMiles)
distanceMiles = []
The problem is that df1
has 168K
rows and df2
has 1200
rows. How do I make it faster?
Upvotes: 1
Views: 2874
Reputation: 2072
Leaving this here in case anyone needs it in the future:
If you need only the minimum distance, then you don't have to bruteforce all the pairs. There are some data structures that can help you solve this in O(n*log(n)) time complexity, which is way faster than the bruteforce method.
For example, you can use a generalized KNearestNeighbors (with k=1) algorithm to do exactly that, given that you pay attention to your points being on a sphere, not a plane. See this SO answer for an example implementation using sklearn.
There seems to be a few libraries to solve this too, like sknni and GriSPy.
Here's also another question that talks a bit about the theory.
Upvotes: 1
Reputation: 898
geopy.distance.distance
uses geodesic
algorithm by default, which is rather slow but more accurate. If you can trade accuracy for speed, you can use great_circle
, which is ~20 times faster:
In [4]: %%timeit
...: distance.distance(newport_ri, cleveland_oh).miles
...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %%timeit
...: distance.great_circle(newport_ri, cleveland_oh).miles
...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Also you may use multiprocessing to parallelize the computation:
from multiprocessing import Pool
from geopy import distance
import numpy as np
def compute(points):
target, point = points
return distance.great_circle(target, point).miles
with Pool() as pool:
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
distanceMiles = pool.map(
compute,
(
(target, (row2["LAT"], row2["LON"]))
for id2, row2 in df2.iterrows()
)
)
closestPoint = np.argmin(distanceMiles)
Upvotes: 6
Reputation: 7361
If you need to check all the the pairs by brute force, I think the following approach is the best you can do.
Looping directly on columns is usually slightly faster than iterrows
, and the vectorized approach replacing the inner loop saves time too.
for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
target = (lat1, lon1)
count = count + 1
# print(count) #printing is also time expensive
df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
closestpoint = df2['dist'].min() #if you want the minimum distance
closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.
Upvotes: 0
Reputation: 19307
This should run much faster if you utilize itertools instead of explicit for loops. Inline comments should help you understand whats happening at each step.
import numpy as np
import itertools
from geopy import distance
#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})
#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))
#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])
#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
44.29982320107605,
25.88823239877388,
23.877570442142783,
29.9351451072828,
...]
Finally, If you are working with a massive dataset, then I would recommend using multiprocessing library to map the itertools.starmap to different cores and asynchronously compute the distance values. Python Multiprocessing library now supports starmap.
Upvotes: 0