Muriel
Muriel

Reputation: 591

Python Calculate the similarity between two origin destination pairs

I'm looking to calculate the similarity between two routes, where a route is defined as an origin-destination pair. I've looked around and couldn't find a similarity measure which takes both direction and length into account, so I invented my own (but to be honest, I'd prefer doing something standard):

enter image description here

Here's the code to calculate similarity score for this metric:

def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
    l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
    l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
    od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
    dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))

    sim = (l1+l2)/(l1+l2+od+dd)
    return sim

My main problem now is that this is too slow - I have 200k origin destination pairs that all need to be compared to each other and on my current compute that'll take 45 days.

Does anyone know of a different approach that I could use?

Upvotes: 0

Views: 268

Answers (1)

Michael Hodel
Michael Hodel

Reputation: 3030

import mpu
import time
from itertools import combinations
import numpy as np
from numpy import sin, cos, sqrt, arctan


# initial function
def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
    l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
    l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
    od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
    dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))
    
    sim = (l1+l2)/(l1+l2+od+dd)
    return sim

# get n random origin, destination coordinates
def get_random_coordinates(n_coordinates):
    MIN_LATITUDE = -90
    MAX_LATITUDE = 90
    MIN_LONGITUDE = -180
    MAX_LONGITUDE = 180
    coordinates = []
    for i in range(n_coordinates):
        o_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
        o_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
        d_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
        d_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
        coordinates.append((o_lat, o_long, d_lat, d_long))
    return coordinates

# compute similarity matrix (iterative approach)
def calculate_similarity_matrix(coordinates, similarity_function):
    n = len(coordinates)
    similarity_matrix = np.diag(np.ones(n))
    for i, j in combinations(range(n), 2):
        o1_lat, o1_long, d1_lat, d1_long = coordinates[i]
        o2_lat, o2_long, d2_lat, d2_long = coordinates[j]
        similarity = similarity_function(
            o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long
        )
        similarity_matrix[i, j] = similarity
        similarity_matrix[j, i] = similarity
    return similarity_matrix

# faster approach
def csm(coordinates):
    n = len(coordinates)
    rc = np.radians(coordinates)
    c = np.array(list(combinations(np.arange(n),2)))
    a = sin((rc[:,2]-rc[:,0])/2)**2+cos(rc[:,0])*cos(rc[:,2])*sin((rc[:,3]-rc[:,1])/2)**2
    d = arctan(sqrt(a/(1-a)))
    l1_p_l2 = d[c[:,0]]+d[c[:,1]]
    aod = sin((rc[c[:,1],0]-rc[c[:,0],0])/2)**2+cos(rc[c[:,0],0])*cos(rc[c[:,1],0])*sin((rc[c[:,1],1]-rc[c[:,0],1])/2)**2
    add = sin((rc[c[:,1],2]-rc[c[:,0],2])/2)**2+cos(rc[c[:,0],2])*cos(rc[c[:,1],2])*sin((rc[c[:,1],3]-rc[c[:,0],3])/2)**2
    tri = np.diag(np.full(n, 1/2))
    tri[np.triu_indices(n,1)] = l1_p_l2/(l1_p_l2+arctan(sqrt(aod/(1-aod)))+arctan(sqrt(add/(1-add))))
    return tri + tri.T

# simulate coordinates
n_coordinates = 1000
np.random.seed(69)
coordinates = get_random_coordinates(n_coordinates)

# number of pairs
target = 200000

# time provided function
start = time.time()
similarity_matrix = calculate_similarity_matrix(coordinates, calculate_similarity)
end = time.time()
runtime = end - start

# time faster approach
start2 = time.time()
similarity_matrix2 = csm(coordinates)
end2 = time.time()
runtime2 = end2 - start2

# ensure correctness
assert np.max(np.abs(similarity_matrix - similarity_matrix2)) < 1e-8

# calculate estimated speedup & runtime
speedup = runtime / runtime2
n_sim = n_coordinates * (n_coordinates - 1) / 2
n_target = target * (target - 1) / 2
ratio = n_target / n_sim
runtime = runtime2 * ratio / 3600

print(f'Estimated speedup: {speedup:.2f}x')
print(f'Estimated runtime: {runtime:.2f} hours')

prints

Estimated speedup: 31.31x
Estimated runtime: 4.70 hours

This code employs a wide range of tricks to speed up runtime, above all, it vectorizes the computation of the similarity measures by using numpy. Some of those tricks are:

  • Avoid unneccessary recomputations of the route distances: If calculate_similarity were to be called on each pair of routes out of a total of n routes, then each route distance would be computed n-1 times. By effectively precomputing the distances only once, one can thus save n-2 distance computations per route.
  • Avoid unneccessarily converting the distances to kilometers by removing the multiplication with the earth's diameter. This works due to the way SimScore is defined, i.e. due to the fact that multiplicating each additive component l1, l2, od, dd by a constant will not affect the ratio (l1 + l2) / (l1 + l2 + od + dd).
  • Avoid unneccessary conversions of latitude and longitude degrees to radians by computing them only once up front.
  • Avoid performing checks for coordinate validity. Note that this preconditions all coordinates to be within the valid range.
  • Simplifying math to save on expensive operations, e.g. rewriting atan2(sqrt(a), sqrt(1 - a) as atan(sqrt(a / (1 - a)) (saving square root operations) or sin(x) * sin(x) as sin(x) ** 2 (saving sine operations).

To estimate the speedup and runtime, n=1000 random routes were simulated and both approaches were timed. To verify correctness, the code asserts the infinity norm of the difference between the similarity matrices to be essentially zero. Note that these estimates should only serve as rough proxies, as the chosen timing method is not very robust and things will likely look a bit different for larger n (the esimated runtime is simply a linear extrapolation) and these numbers were achieved on a GPU via Google Colab, which additionally affects the estimation bias.

I believe there are still optimizations that could be made, especially: If highly accurate similarity measures are not required for the use case at hand and routes (i.e. l1, l2) and/or distances between start- and endpoints (i.e. od, dd) are small enough, then using a cheaper to compute distance metric may approximate the Haversine distance well enough, and thus could be employed if/where applicable.

However, as 200000 * (200000 - 1) / 2 = 19'999'900'000 similarity measures are to be computed, this will inevitably take a while.

Upvotes: 1

Related Questions