Reputation: 2949
I have a TF-IDF matrix of shape (149,1001). What is want is to compute the cosine similarity of last columns, with all columns
Here is what I did
from numpy import dot
from numpy.linalg import norm
for i in range(mat.shape[1]-1):
cos_sim = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
cos_sim
But this loop is making it slow. So, is there any efficient way? I want to do with numpy only
Upvotes: 6
Views: 6816
Reputation: 88226
There's an sklearn function to compute the cosine similarity between vectors, cosine_similarity
. Here's a use case with an example array:
a = np.random.randint(0,10,(5,5))
print(a)
array([[5, 2, 0, 4, 1],
[4, 2, 8, 2, 4],
[9, 7, 4, 9, 7],
[4, 6, 0, 1, 3],
[1, 1, 2, 5, 0]])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(a[None,:,-1] , a.T[:-1])
# array([[0.94022805, 0.91705665, 0.75592895, 0.79921221, 1. ]])
Where a[None,-1]
is the last column in a
, reshaped so that both matrices have equally shaped Mat.shape[1]
, which is a requirement of the function:
a[None,:,-1]
# array([[1, 4, 7, 3, 0]])
And by transposing, the result will be the cosine_similarity
with all other columns.
Check with the solution from the question:
from numpy import dot
from numpy.linalg import norm
cos_sim = []
for i in range(a.shape[1]-1):
cos_sim.append(dot(a[:,i], a[:,-1])/(norm(a[:,i])*norm(a[:,-1])))
np.allclose(cos_sim, cosine_similarity(a[None,:,-1] , a.T[:-1]))
# True
Upvotes: 2
Reputation: 221504
Leverage 2D
vectorized matrix-multiplication
Here's one with NumPy using matrix-multiplication on 2D data -
p1 = mat[:,-1].dot(mat[:,:-1])
p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
out1 = p1/p2
Explanation : p1
is the vectorized equivalent of looping of dot(mat[:,i], mat[:,-1])
. p2
is of (norm(mat[:,i])*norm(mat[:,-1]))
.
Sample run for verification -
In [57]: np.random.seed(0)
...: mat = np.random.rand(149,1001)
In [58]: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
In [59]: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
In [60]: np.allclose(out, out1)
Out[60]: True
Timings -
In [61]: %%timeit
...: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
18.5 ms ± 977 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
939 µs ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [89]: a = mat
In [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])
2.47 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Further optimize on norm
with einsum
Alternatively, we could compute p2
with np.einsum
.
So, norm(mat[:,:-1],axis=0)
could be replaced by :
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))
Hence, giving us a modified p2
:
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
Timings on same setup as earlier -
In [82]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
...: out1 = p1/p2
607 µs ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
30x+
speedup over loopy one!
Upvotes: 6