Python - inefficient spatial distance calculation (how can it be speed up)

Question

I am currently trying some geocoding in Python. The process is the following: I have two data frames (df1 and df2, houses and schools) with latitude and longitude values and want to find the nearest neighbour in df2 for every observations in df1. I use the following code:

from tqdm import tqdm
import numpy as np
import pandas as pd
import math 

def distance(lat1, long1, lat2, long2):
        R = 6371 # Earth Radius in Km
        dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians 
        dLong = math.radians(long2 - long1)
        lat1 = math.radians(lat1)
        lat2 = math.radians(lat2)
        a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) * math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
        d = R * c
        return d

dists = []
schools =[]
for index, row1 in tqdm(df1.iterrows()):
    for index, row2 in df2.iterrows():
        dists.append(distance(row1.lat, row1.lng, row2.Latitude, row2.Longitude))
    schools.append(min(dists))
    del dists [:]

df1["school"] = pd.Series(schools)

The code works, however it takes ages. With tqdm I get an average speed of 2 iterations of df1 per second. As a comparison, I did the whole task in STATA with geonear and it takes 1 second for all observations in df1 (950). I read in the helpfile of geonear that they use clustering, to not calculate all distances, but only the closest. However, before I add a clustering function (which might also take CPU power), I wonder if someone sees some way to speed up the process as it is (I am new to python and might have some inefficient code that slows the process down). Or is there maybe a package that does the process faster?

I would be ok if it takes longer than in STATA, but not nearly 7 minutes...

Thank you in advance

Richard · Accepted Answer

The way you're doing this is slow because you are using an O(n²) algorithm: each row looks at every other row. Georgy's answer, while introducing vectorization, does not solve this fundamental inefficiency.

I'd recommend loading your data points into a kd-tree: this data structure provides a fast way of finding nearest neighbours in multiple dimensions. Construction of such a tree is in O(n log n) and a query takes O(log n), so total time is in O(n log n).

If your data is localized to a geographic region that can be well-approximated by a plane, project your data and then perform the lookup in two dimensions. Otherwise, if your data is globally dispersed, project into spherical cartesian coordinates and perform the look-up there.

An example of how you might do this appears as follows:

#/usr/bin/env python3

import numpy as np
import scipy as sp
import scipy.spatial

Rearth = 6371

#Generate uniformly-distributed lon-lat points on a sphere
#See: http://mathworld.wolfram.com/SpherePointPicking.html
def GenerateUniformSpherical(num):
  #Generate random variates
  pts      = np.random.uniform(low=0, high=1, size=(num,2))
  #Convert to sphere space
  pts[:,0] = 2*np.pi*pts[:,0]          #0-360 degrees
  pts[:,1] = np.arccos(2*pts[:,1]-1)   #0-180 degrees
  #Convert to degrees
  pts = np.degrees(pts)
  #Shift ranges to lon-lat
  pts[:,0] -= 180
  pts[:,1] -= 90
  return pts

def ConvertToXYZ(lonlat):
  theta  = np.radians(lonlat[:,0])+np.pi
  phi    = np.radians(lonlat[:,1])+np.pi/2
  x      = Rearth*np.cos(theta)*np.sin(phi)
  y      = Rearth*np.sin(theta)*np.sin(phi)
  z      = Rearth*np.cos(phi)
  return np.transpose(np.vstack((x,y,z)))

#For each entry in qpts, find the nearest point in the kdtree
def GetNearestNeighbours(qpts,kdtree):
  pts3d        = ConvertToXYZ(qpts)
  #See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
  #p=2 implies Euclidean distance, eps=0 implies no approximation (slower)
  return kdtree.query(pts3d,p=2,eps=0) 

#Generate uniformly-distributed test points on a sphere. Note that you'll want
#to find a way to extract your pandas columns into an array of width=2, height=N
#to match this format.
df1 = GenerateUniformSpherical(10000)
df2 = GenerateUniformSpherical(10000)

#Convert df2 into XYZ coordinates. WARNING! Do not alter df2_3d or kdtree will
#malfunction!
df2_3d = ConvertToXYZ(df2)
#Build a kd-tree from df2_3D
kdtree = sp.spatial.KDTree(df2_3d, leafsize=10) #Stick points in kd-tree for fast look-up

#Return the distance to, and index of, each of df1's nearest neighbour points
distance, indices = GetNearestNeighbours(df1,kdtree)

Python - inefficient spatial distance calculation (how can it be speed up)

Answers (2)

Related Questions