Improve time complexity of this algorithm

Question

I have N pairs of string lists (N from set 1 to N from set 2) that need to be paired to the closest one through the Jaccard similarity. That means that I need to compute N^2 distances and take for each element in set 1 the maximum similarity w.r.t. set 2.

A simple code to run it would be

import numpy as np

def jaccard_similarity(a, b):
    intersection = set(a).intersection(set(b))
    union = set(a).union(set(b))
    return len(intersection)/len(union)

set_1 = [['Pisa','Tower','River','Tuscany'],['London','City','UK','England'],['Berlin','Germany','Munich']]
set_2 = [['Pisa','Arno','River','Tuscany','Florence','London','Tower'],['Germany','German','UBanh'],['London','City','UK','England','Europe']]

pairs = []

for vect_1 in set_1:
    dist = []
    for vect_2 in set_2:
        dist.append(jaccard_similarity(vect_1,vect_2))
    pairs.append(np.argmax(dist))

print(pairs)

I know this has O(N^2) time complexity, but I was wondering whether there might be some optimization/heuristic so that the average case would be better.

Similarly, is there something concerning the code itself that I might optimize?

EDIT: I modified the question to make it more precise.

user3433489 · Accepted Answer

You should be able to use scipy.spatial.distance.cdist, which computes the entire matrix for a given metric. The time complexity is unavoidable, but scipy makes it fast.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

Improve time complexity of this algorithm

Answers (1)

Related Questions