Reputation: 21
x_tsvd is matrix of length 4.6 million(row). svd_tfidf is matrix of length 1862(row).
Both matrix has same number of column(260).
And i wand to calcuate cosine similarity for each 4.6 M rows of x_tsvd for each 1862 svd_tfidf.
Is there any way i can optimize it so that it take less time.
from numpy.linalg import norm
best_match=[]
keys=np.array(df_5M['file'])
values=np.array(df['file'])
for i in range(len(x_tsvd)):
array_=[]
for j in range (len(svd_tfidf)):
cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
array_.append(cosine_similarity_)
index=np.array(array_).argsort()
best_match.append({keys[i]:values[index][::-1][0:5]})
Update:
from numpy.linalg import norm
best_match=[]
#b=copy.copy(svd_tfidf)
keys=np.array(df_5M['file'])
values=np.array(df['file'])
#b=copy.copy(svd_tfidf)
for i in range(len(x_tsvd)):
a=x_tsvd[i]
b=svd_tfidf
a_dot_b=np.sum(np.multiply(a,b),axis=1)
norm_a=norm(a)
norm_b=norm(b,axis=1)
cosine_similarity_=a_dot_b/(norm_a*norm_b)
index=np.argsort(cosine_similarity_)
best_match.append({keys[i]:values[index][::-1][0:6]})
```
Upvotes: 0
Views: 159
Reputation: 50806
There are several issues in your code. First of all, norm(x_tsvd[i])
is recomputed len(svd_tfidf)=1862
times while the expression can be move in the parent loop. Furthermore, norm(svd_tfidf[j])
is recomputed len(x_tsvd)=4.6e6
times while the expression can be precomputed for all j
values only once. Moreover, calling np.dot(x_tsvd[i],svd_tfidf[j])
in two nested loops is not efficient. You can use a big matrix multiplication: x_tsvd @ svd_tfidf.T
. However, since this matrix is huge (~64 GiB), it is reasonable to split x_tsvd
in chunks of size 512~4096. Additionally, you can precompute the inverse of the norm because the multiplication by the inverse value is generally significantly faster than divisions. np.argsort(tmp_matrix[i])[::-1][0:5]]
is not efficient and argpartition
can be used instead to only compute the 5 best items (as I pointed out in a comment of the previous answer which advised you to use argsort
). Note that a partition does not behave the same way than a sort if you care about equal items (ie. stable sort). There are no stable partitioning implementation available yet in Numpy.
In the end the optimized implementation should look like:
inv_norm_j = 1.0 / norm_by_line(svd_tfidf) # Horizontal vector
for chunk_start in range(0, len(x_tsvd), chunk_size):
chunk_end = min(chunk_start + chunk_size, len(x_tsvd))
inv_norm_i = 1.0 / norm_by_line(x_tsvd_block)[:,None] # Vertical vector
x_tsvd_block = x_tsvd[chunk_start:chunk_end]
tmp_matrix = (x_tsvd_block @ svd_tfidf.T) * inv_norm_i * inv_norm_j
best_match_values = values[np.sort(np.argpartition(tmp_matrix, len(svd_tfidf)-5)[:,-5:])[:,::-1]]
# Pure-Python part that can hardly be optimized
for i in range(chunk_start, chunk_end):
best_match.append({keys[i]: best_match_values[i]})
Where norm_by_line
can be computed in a vectorized way (certainly with Scipy for example). Note that this is a untested draft and not a code that you should trust completely and copy-part blindly ;) .
Regarding the recent update (which is a code computing a different result), most optimizations are identical but there is a big improvement you can do on np.sum(np.multiply(a,b),axis=1)
. Indeed, you can use np.einsum('ij,ij->i', a, b)
instead so not to compute the large expensive temporary matrix. It is 3 times faster on my machine.
Upvotes: 1