How to speed up the search of the closest geopoint?

Question

Currently, I use the following code to find the closest given geopoint -

def get_closest_stops(lat, lng, data=db.STOPS):
    def distance(lat1, lon1, lat2, lon2):
        p = 0.017453292519943295
        a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
        return 12742 * asin(sqrt(a))

    v = {'lat': lat, 'lng': lng}

    return sorted(data.values(), key=lambda p: distance(v['lat'],v['lng'],p['lat'],p['lng']), reverse=False)

and here is the db.STOPS:

STOPS = {
    '1282': {'lat': 59.773368, 'lng': 30.125112, 'image_osm': '1652229/464dae0763a8b1d5495e', 'name': 'name 1', 'descr': ''},
    '1285': {'lat': 59.941117, 'lng': 30.271756, 'image_osm': '965417/63af068831d93dac9830', 'name': 'name 2', 'descr': ''},
...
}

The dict contains about 7000 records and the search is slow. Is there any way to speed up the search? I need just 5 results. And I can re-sort the dict. I can even create a copy of the dict, if needed.

Marat · Accepted Answer

TLDR: numpy gives 10x speedup, partitioning distances speeds up by another 3x, ~30..50x total. Using manhattan distance as a cheaper approximation to eliminate infeasible candidates has about the same effect, but is less intuitive.

Here is a colab with full code for my experiments, results below:

# Setup:
import numpy as np
DTYPE = np.float64
STOPS_NP = np.array([(stop['lat'], stop['lng']) for stop in STOPS.values()], dtype=[('lat', DTYPE), ('lng', DTYPE)])

Naive implementation: 8ms per loop

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def get_closest_stops(lat, lng, data=STOPS):
    v = {'lat': lat, 'lng': lng}
    return sorted(data.values(), key=lambda p: distance(v['lat'],v['lng'],p['lat'],p['lng']), reverse=False)

Same algorithm ported to Numpy: 1.3ms per loop using float64, 500us using float32 (13x improvement!) :

def distance_np(lat1, lon1, lat2, lon2):
    # numpy version of distance
    p = 0.017453292519943295
    a = 0.5 - np.cos((lat2-lat1)*p)/2 + np.cos(lat1*p)*np.cos(lat2*p) * (1-np.cos((lon2-lon1)*p)) / 2
    # multiplication by constant does not affect sorting
    return 12742 * np.arcsin(np.sqrt(a))

def get_closest_stops_np(lat, lng, data=STOPS_NP):
    distances = distance_np(lat, lng, data['lat'], data['lng'])
    indexes = distances.argsort()
    return data[indexes]

Using partial array sort to get top 5 candidates - 500us float64, 160us float32 (50x improvement!):

def get_closest_stops_np_argpartition(lat, lng, data=STOPS_NP, n=5):
    distances = distance_np(lat, lng, data['lat'], data['lng'])
    indexes = np.argpartition(distances, n)[:n]
    return data[indexes]

Using cheaper approximation for distance to avoid calculating more expensive exact distance - what I initially proposed in the comments: 600us

def get_closest_stops_np_early_stop(lat, lng, data=STOPS_NP, n=5):
    # shortcut: use cheaper approximate distance as an upper bound,
    # create a much smaller shortlist, then sort by real (expensive) distance
    distances = manhattan_distance_np(lat, lng, data['lat'], data['lng'])
    indexes = distances.argsort()
    nth_large_manhattan_distance = distances[indexes[n-1]]
    # manhattan distance is at most 1.5x higher than real distance
    # on a plane; perhaps should be adjusted for the globe
    max_manhattan_for_equivalent_distance = nth_large_manhattan_distance * 1.5
    n_feasible_candidates = np.searchsorted(distances, max_manhattan_for_equivalent_distance, sorter=indexes)
    data_shortlist = data[indexes[:n_feasible_candidates]]

    # the rest is the same as get_closest_stops_np
    distances = distance_np(lat, lng, data_shortlist['lat'], data_shortlist['lng'])
    indexes = distances.argsort()
    return data_shortlist[indexes[:n]]

All four algorithms produce the same result on the simulated data. Note that longitude degree on a globe is typically smaller than a latitude degree, so horizontal scale on this simulated plot doesn't match vertical scale:

How to speed up the search of the closest geopoint?

Answers (2)

Related Questions