Reputation: 591
I'm looking to calculate the similarity between two routes, where a route is defined as an origin-destination pair. I've looked around and couldn't find a similarity measure which takes both direction and length into account, so I invented my own (but to be honest, I'd prefer doing something standard):
Here's the code to calculate similarity score for this metric:
def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))
sim = (l1+l2)/(l1+l2+od+dd)
return sim
My main problem now is that this is too slow - I have 200k origin destination pairs that all need to be compared to each other and on my current compute that'll take 45 days.
Does anyone know of a different approach that I could use?
Upvotes: 0
Views: 268
Reputation: 3030
import mpu
import time
from itertools import combinations
import numpy as np
from numpy import sin, cos, sqrt, arctan
# initial function
def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))
sim = (l1+l2)/(l1+l2+od+dd)
return sim
# get n random origin, destination coordinates
def get_random_coordinates(n_coordinates):
MIN_LATITUDE = -90
MAX_LATITUDE = 90
MIN_LONGITUDE = -180
MAX_LONGITUDE = 180
coordinates = []
for i in range(n_coordinates):
o_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
o_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
d_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
d_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
coordinates.append((o_lat, o_long, d_lat, d_long))
return coordinates
# compute similarity matrix (iterative approach)
def calculate_similarity_matrix(coordinates, similarity_function):
n = len(coordinates)
similarity_matrix = np.diag(np.ones(n))
for i, j in combinations(range(n), 2):
o1_lat, o1_long, d1_lat, d1_long = coordinates[i]
o2_lat, o2_long, d2_lat, d2_long = coordinates[j]
similarity = similarity_function(
o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long
)
similarity_matrix[i, j] = similarity
similarity_matrix[j, i] = similarity
return similarity_matrix
# faster approach
def csm(coordinates):
n = len(coordinates)
rc = np.radians(coordinates)
c = np.array(list(combinations(np.arange(n),2)))
a = sin((rc[:,2]-rc[:,0])/2)**2+cos(rc[:,0])*cos(rc[:,2])*sin((rc[:,3]-rc[:,1])/2)**2
d = arctan(sqrt(a/(1-a)))
l1_p_l2 = d[c[:,0]]+d[c[:,1]]
aod = sin((rc[c[:,1],0]-rc[c[:,0],0])/2)**2+cos(rc[c[:,0],0])*cos(rc[c[:,1],0])*sin((rc[c[:,1],1]-rc[c[:,0],1])/2)**2
add = sin((rc[c[:,1],2]-rc[c[:,0],2])/2)**2+cos(rc[c[:,0],2])*cos(rc[c[:,1],2])*sin((rc[c[:,1],3]-rc[c[:,0],3])/2)**2
tri = np.diag(np.full(n, 1/2))
tri[np.triu_indices(n,1)] = l1_p_l2/(l1_p_l2+arctan(sqrt(aod/(1-aod)))+arctan(sqrt(add/(1-add))))
return tri + tri.T
# simulate coordinates
n_coordinates = 1000
np.random.seed(69)
coordinates = get_random_coordinates(n_coordinates)
# number of pairs
target = 200000
# time provided function
start = time.time()
similarity_matrix = calculate_similarity_matrix(coordinates, calculate_similarity)
end = time.time()
runtime = end - start
# time faster approach
start2 = time.time()
similarity_matrix2 = csm(coordinates)
end2 = time.time()
runtime2 = end2 - start2
# ensure correctness
assert np.max(np.abs(similarity_matrix - similarity_matrix2)) < 1e-8
# calculate estimated speedup & runtime
speedup = runtime / runtime2
n_sim = n_coordinates * (n_coordinates - 1) / 2
n_target = target * (target - 1) / 2
ratio = n_target / n_sim
runtime = runtime2 * ratio / 3600
print(f'Estimated speedup: {speedup:.2f}x')
print(f'Estimated runtime: {runtime:.2f} hours')
prints
Estimated speedup: 31.31x
Estimated runtime: 4.70 hours
This code employs a wide range of tricks to speed up runtime, above all, it vectorizes the computation of the similarity measures by using numpy
. Some of those tricks are:
calculate_similarity
were to be called on each pair of routes out of a total of n
routes, then each route distance would be computed n-1
times. By effectively precomputing the distances only once, one can thus save n-2
distance computations per route.SimScore
is defined, i.e. due to the fact that multiplicating each additive component l1
, l2
, od
, dd
by a constant will not affect the ratio (l1 + l2) / (l1 + l2 + od + dd)
.atan2(sqrt(a), sqrt(1 - a)
as atan(sqrt(a / (1 - a))
(saving square root operations) or sin(x) * sin(x)
as sin(x) ** 2
(saving sine operations).To estimate the speedup and runtime, n=1000
random routes were simulated and both approaches were timed. To verify correctness, the code asserts the infinity norm of the difference between the similarity matrices to be essentially zero. Note that these estimates should only serve as rough proxies, as the chosen timing method is not very robust and things will likely look a bit different for larger n
(the esimated runtime is simply a linear extrapolation) and these numbers were achieved on a GPU via Google Colab, which additionally affects the estimation bias.
I believe there are still optimizations that could be made, especially: If highly accurate similarity measures are not required for the use case at hand and routes (i.e. l1
, l2
) and/or distances between start- and endpoints (i.e. od
, dd
) are small enough, then using a cheaper to compute distance metric may approximate the Haversine distance well enough, and thus could be employed if/where applicable.
However, as 200000 * (200000 - 1) / 2 = 19'999'900'000
similarity measures are to be computed, this will inevitably take a while.
Upvotes: 1