Reputation: 171
I have a set of vectors (~30k), each of which consists of 300 elements generated by fasttext, each vector is representing the meaning of an entity, I want to calculate the similarity between all entities, so I iterate over the vectors in a nested matter, O(N^2) complexity, which is not practical in terms of time.
Can you recommend me another approach for calculating this, or how can I parallelize it?
def calculate_similarity(v1, v2):
"""
Calculate cosine distance between two vectors
"""
n1 = np.linalg.norm(v1)
n2 = np.linalg.norm(v2)
return np.dot(v1, v2) / n1 / n2
similarities = {}
for ith_entity, ith_vector in vectors.items():
for jth_entity, jth_vector in vectors.items():
if ith_entity == jth_entity:
continue
if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
continue
similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)
Upvotes: 4
Views: 7042
Reputation: 13
Use a ball tree, I have used it on a very large feature vector with shape (16460,4096). Firstly construct a tree using the chunk below
from sklearn.neighbors import BallTree
tree = BallTree(features_tsvd, metric = spatial.distance.cosine)
Now to search a query in the tree try something like this:
dists, ind = tree.query(query, k=10)
Upvotes: 0
Reputation: 11643
I had a go at vectorizing it.
import numpy as np
from itertools import combinations
np.random.seed(1)
vector_data = np.random.randn(3, 3)
v1, v2, v3 = vector_data[0], vector_data[1], vector_data[2]
def similarities_vectorized(vector_data):
norms = np.linalg.norm(vector_data, axis=1)
combs = np.stack(combinations(range(vector_data.shape[0]),2))
similarities = (vector_data[combs[:,0]]*vector_data[combs[:,1]]).sum(axis=1)/norms[combs][:,0]/norms[combs][:,1]
return combs, similarities
combs, similarities = similarities_vectorized(vector_data)
for comb, similarity in zip(combs, similarities):
print(comb, similarity)
Output:
[0 1] -0.217095007411
[0 2] 0.894174618451
[1 2] -0.630555641519
Compare result with code from Question:
def calculate_similarity(v1, v2):
"""
Calculate cosine distance between two vectors
"""
n1 = np.linalg.norm(v1)
n2 = np.linalg.norm(v2)
return np.dot(v1, v2) / n1 / n2
def calculate_simularities(vectors):
similarities = {}
for ith_entity, ith_vector in vectors.items():
for jth_entity, jth_vector in vectors.items():
if ith_entity == jth_entity:
continue
if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
continue
similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)
return similarities
vectors = {'A': v1, 'B': v2, 'C': v3}
print(calculate_simularities(vectors))
Output:
{('A', 'B'): -0.21709500741113338, ('A', 'C'): 0.89417461845058566, ('B', 'C'): -0.63055564151883581}
The vectorized version was about 3.3 times faster when I ran it on a set of 300 vectors.
UPDATE:
This version is about 50 times faster than the original:
def similarities_vectorized2(vector_data):
norms = np.linalg.norm(vector_data, axis=1)
combs = np.fromiter(combinations(range(vector_data.shape[0]),2), dtype='i,i')
similarities = (vector_data[combs['f0']]*vector_data[combs['f1']]).sum(axis=1)/norms[combs['f0']]/norms[combs['f1']]
return combs, similarities
combs, similarities = similarities_vectorized2(vector_data)
for comb, similarity in zip(combs, similarities):
print(comb, similarity)
Output:
(0, 1) -0.217095007411
(0, 2) 0.894174618451
(1, 2) -0.630555641519
Upvotes: 1
Reputation: 5486
You could get rid of the nested loop, which is slow, by using scipy
's distance module.
Given vectors = {'k1':v1, 'k2':v2, ..., 'km':vm}
with vi
being a Python List of length n.
import numpy as np
from scipy.spatial import distance
# transfrom vectors to m x n numpy array
data = np.array(list(vectors.values())
# compute pairwise cosine distance
pws = distance.pdist(data, metric='cosine')
pws
is condensed distance matrix. It is one-dimensional and holds the distances in the following order:
pws = np.array([ (k1, k2), (k1, k3), (k1, k4), ..., (k1, km),
(k2, k3), (k2, k4), ..., (k2, km),
...,
(km-1, km) ])
Note also that distance.pdist
calculates the cosine distance rather than cosine similarity.
Upvotes: 3