BoltzmannBrain
BoltzmannBrain

Reputation: 5392

Vectorized cosine similarity calculation in Python

I have two large sets of vectors, A and B. Each element of A is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A, I'm trying to calculate the cosine similarities to all vectors in B in order to find the top 5 vectors in B that best match the given A vector. For now I'm looping over all of A, and looping over all of B, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b). Is there a faster way to do this? Perhaps with matrices?

Upvotes: 9

Views: 9105

Answers (3)

Mohammad Hossein
Mohammad Hossein

Reputation: 1

You can use "fast approximate nearest neighbor search" instead, which is faster but less accurate.

Upvotes: 0

gboffi
gboffi

Reputation: 25023

This is a NAIVE no loop, no overhead(?) implementation of what you need...

from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)

Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance?


ps, axis=1 above is based on the assumption that your vectors are stored row wise,

print A
# [[1 2 3 4 5 6 7 8 ... 400]
#  [2 3 4 5 6 7 8 9 ... 401]

etc


Comments

In [79]: A = np.random.random((2,5))

In [80]: A
Out[80]: 
array([[ 0.2917865 ,  0.89617367,  0.27118045,  0.58596817,  0.05154168],
       [ 0.61131638,  0.2859271 ,  0.09411264,  0.57995386,  0.09829525]])

In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988,  0.90018201])

In [82]: norm(A,axis=1)[...,None]
Out[82]: 
array([[ 1.14359988],
       [ 0.90018201]])

In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)

ValueError: operands could not be broadcast together with shapes (2,5) (2,) 

In [84]: A/norm(A,axis=1)[...,None]
Out[84]: 
array([[ 0.25514737,  0.78364267,  0.23712878,  0.51238915,  0.04506968],
       [ 0.67910309,  0.31763254,  0.10454846,  0.64426289,  0.10919486]])

In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1.,  1.])

In [86]: 

The session above is for explaining the normalisation procedure, when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1 this matrix and we have a matrix of cosine distances. Or so I hope...

Test & Benchmark

In [1]: import numpy as np                                              

In [2]: from numpy.linalg import norm as n

In [3]: from scipy.spatial.distance import cosine

In [4]: A = np.random.random((100,400))

In [5]: B = np.random.random((100,400))

In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])

In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)

In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16

In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16

In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop

In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop

In [12]: 

Upvotes: 6

jofel
jofel

Reputation: 3405

You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to

 d = 1 - e_v * e_w

 with e_v = v / ||v||_2 , e_w = w / ||v||_2 

which is faster to calculate.

Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine'). You need to build a matrix from the sets of vectors (pseudo-code):

XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')

Upvotes: 7

Related Questions