Reputation: 5392
I have two large sets of vectors, A
and B
. Each element of A
is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A
, I'm trying to calculate the cosine similarities to all vectors in B
in order to find the top 5 vectors in B
that best match the given A
vector. For now I'm looping over all of A
, and looping over all of B
, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b)
. Is there a faster way to do this? Perhaps with matrices?
Upvotes: 9
Views: 9105
Reputation: 1
You can use "fast approximate nearest neighbor search" instead, which is faster but less accurate.
Upvotes: 0
Reputation: 25023
This is a NAIVE no loop, no overhead(?) implementation of what you need...
from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)
Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance?
ps, axis=1
above is based on the assumption that your vectors are stored row wise,
print A
# [[1 2 3 4 5 6 7 8 ... 400]
# [2 3 4 5 6 7 8 9 ... 401]
etc
In [79]: A = np.random.random((2,5))
In [80]: A
Out[80]:
array([[ 0.2917865 , 0.89617367, 0.27118045, 0.58596817, 0.05154168],
[ 0.61131638, 0.2859271 , 0.09411264, 0.57995386, 0.09829525]])
In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988, 0.90018201])
In [82]: norm(A,axis=1)[...,None]
Out[82]:
array([[ 1.14359988],
[ 0.90018201]])
In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)
ValueError: operands could not be broadcast together with shapes (2,5) (2,)
In [84]: A/norm(A,axis=1)[...,None]
Out[84]:
array([[ 0.25514737, 0.78364267, 0.23712878, 0.51238915, 0.04506968],
[ 0.67910309, 0.31763254, 0.10454846, 0.64426289, 0.10919486]])
In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1., 1.])
In [86]:
The session above is for explaining the normalisation procedure,
when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j
is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1
this matrix and we have a matrix of cosine distances. Or so I hope...
In [1]: import numpy as np
In [2]: from numpy.linalg import norm as n
In [3]: from scipy.spatial.distance import cosine
In [4]: A = np.random.random((100,400))
In [5]: B = np.random.random((100,400))
In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])
In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16
In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16
In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop
In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop
In [12]:
Upvotes: 6
Reputation: 3405
You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to
d = 1 - e_v * e_w
with e_v = v / ||v||_2 , e_w = w / ||v||_2
which is faster to calculate.
Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine')
.
You need to build a matrix from the sets of vectors (pseudo-code):
XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')
Upvotes: 7