Reputation: 190
I have N pairs of string lists (N from set 1 to N from set 2) that need to be paired to the closest one through the Jaccard similarity. That means that I need to compute N^2 distances and take for each element in set 1 the maximum similarity w.r.t. set 2.
A simple code to run it would be
import numpy as np
def jaccard_similarity(a, b):
intersection = set(a).intersection(set(b))
union = set(a).union(set(b))
return len(intersection)/len(union)
set_1 = [['Pisa','Tower','River','Tuscany'],['London','City','UK','England'],['Berlin','Germany','Munich']]
set_2 = [['Pisa','Arno','River','Tuscany','Florence','London','Tower'],['Germany','German','UBanh'],['London','City','UK','England','Europe']]
pairs = []
for vect_1 in set_1:
dist = []
for vect_2 in set_2:
dist.append(jaccard_similarity(vect_1,vect_2))
pairs.append(np.argmax(dist))
print(pairs)
I know this has O(N^2) time complexity, but I was wondering whether there might be some optimization/heuristic so that the average case would be better.
Similarly, is there something concerning the code itself that I might optimize?
EDIT: I modified the question to make it more precise.
Upvotes: 2
Views: 272
Reputation: 989
You should be able to use scipy.spatial.distance.cdist, which computes the entire matrix for a given metric. The time complexity is unavoidable, but scipy makes it fast.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
Upvotes: 2