Reputation: 637
I am currently trying some geocoding in Python. The process is the following: I have two data frames (df1 and df2, houses and schools) with latitude and longitude values and want to find the nearest neighbour in df2 for every observations in df1. I use the following code:
from tqdm import tqdm
import numpy as np
import pandas as pd
import math
def distance(lat1, long1, lat2, long2):
R = 6371 # Earth Radius in Km
dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians
dLong = math.radians(long2 - long1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) * math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d
dists = []
schools =[]
for index, row1 in tqdm(df1.iterrows()):
for index, row2 in df2.iterrows():
dists.append(distance(row1.lat, row1.lng, row2.Latitude, row2.Longitude))
schools.append(min(dists))
del dists [:]
df1["school"] = pd.Series(schools)
The code works, however it takes ages. With tqdm I get an average speed of 2 iterations of df1 per second. As a comparison, I did the whole task in STATA with geonear and it takes 1 second for all observations in df1 (950). I read in the helpfile of geonear that they use clustering, to not calculate all distances, but only the closest. However, before I add a clustering function (which might also take CPU power), I wonder if someone sees some way to speed up the process as it is (I am new to python and might have some inefficient code that slows the process down). Or is there maybe a package that does the process faster?
I would be ok if it takes longer than in STATA, but not nearly 7 minutes...
Thank you in advance
Upvotes: 4
Views: 564
Reputation: 61289
The way you're doing this is slow because you are using an O(n²) algorithm: each row looks at every other row. Georgy's answer, while introducing vectorization, does not solve this fundamental inefficiency.
I'd recommend loading your data points into a kd-tree: this data structure provides a fast way of finding nearest neighbours in multiple dimensions. Construction of such a tree is in O(n log n) and a query takes O(log n), so total time is in O(n log n).
If your data is localized to a geographic region that can be well-approximated by a plane, project your data and then perform the lookup in two dimensions. Otherwise, if your data is globally dispersed, project into spherical cartesian coordinates and perform the look-up there.
An example of how you might do this appears as follows:
#/usr/bin/env python3
import numpy as np
import scipy as sp
import scipy.spatial
Rearth = 6371
#Generate uniformly-distributed lon-lat points on a sphere
#See: http://mathworld.wolfram.com/SpherePointPicking.html
def GenerateUniformSpherical(num):
#Generate random variates
pts = np.random.uniform(low=0, high=1, size=(num,2))
#Convert to sphere space
pts[:,0] = 2*np.pi*pts[:,0] #0-360 degrees
pts[:,1] = np.arccos(2*pts[:,1]-1) #0-180 degrees
#Convert to degrees
pts = np.degrees(pts)
#Shift ranges to lon-lat
pts[:,0] -= 180
pts[:,1] -= 90
return pts
def ConvertToXYZ(lonlat):
theta = np.radians(lonlat[:,0])+np.pi
phi = np.radians(lonlat[:,1])+np.pi/2
x = Rearth*np.cos(theta)*np.sin(phi)
y = Rearth*np.sin(theta)*np.sin(phi)
z = Rearth*np.cos(phi)
return np.transpose(np.vstack((x,y,z)))
#For each entry in qpts, find the nearest point in the kdtree
def GetNearestNeighbours(qpts,kdtree):
pts3d = ConvertToXYZ(qpts)
#See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
#p=2 implies Euclidean distance, eps=0 implies no approximation (slower)
return kdtree.query(pts3d,p=2,eps=0)
#Generate uniformly-distributed test points on a sphere. Note that you'll want
#to find a way to extract your pandas columns into an array of width=2, height=N
#to match this format.
df1 = GenerateUniformSpherical(10000)
df2 = GenerateUniformSpherical(10000)
#Convert df2 into XYZ coordinates. WARNING! Do not alter df2_3d or kdtree will
#malfunction!
df2_3d = ConvertToXYZ(df2)
#Build a kd-tree from df2_3D
kdtree = sp.spatial.KDTree(df2_3d, leafsize=10) #Stick points in kd-tree for fast look-up
#Return the distance to, and index of, each of df1's nearest neighbour points
distance, indices = GetNearestNeighbours(df1,kdtree)
Upvotes: 4
Reputation: 13707
The key to efficiency with pandas is performing operations for a whole dataframe/series, not go row by row. So, let's do this.
for index, row1 in tqdm(df1.iterrows()): for index, row2 in df2.iterrows():
Here you calculate Cartesian product of two dataframes. This can be done much faster like this:
df_product = pd.merge(df1.assign(key=0, index=df1.index),
df2.assign(key=0),
on='key').drop('key', axis=1)
(Code was taken from here). I also added a column with indices of df1
, we will need it later to calculate min
of distances for every entity from df1
.
Now calculating all the deltas, latitudes in radians, a
, c
and distances in vectorized manner using numpy:
dLat = np.radians(df_product['Latitude'] - df_product['lat'])
dLong = np.radians(df_product['Longitude'] - df_product['lng'])
lat1 = np.radians(df_product['lat'])
lat2 = np.radians(df_product['Latitude'])
a = (np.sin(dLat / 2) ** 2
+ (np.sin(dLong / 2) ** 2) * np.cos(lat1) * np.cos(lat2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df_product['d'] = R * c
Now, from df_product
we leave only column with indices that we added before and column with distances. We group distances by indices, calculate corresponding minimum values and assign them to df1['schools']
as you did in your code.
df1['schools'] = df_product.loc[:, ['index', 'd']].groupby('index', axis=0).min()
This is it. For 1000 rows in each dataframe everything takes less than a second for me.
Upvotes: 2