Reputation: 15
I am trying to compute the cosine similarity between TFIDF vector representations of documents (there are 500 documents in the MySQL database) and TFIDF vector representation of the user query. Initially, I had written my own code to perform this computation (My code is commented in the snippet). This took more than 4.8 seconds on an average to perform the computation. Then on searching for ways to reduce the computation time I tried to use numpy library for this. However, now also it is taking 1.3 seconds to perform the computation.
def csim(dtv,qv):
csim=[]
b = np.array(qv)
for i in range(len(dtv)):
# numerator= np.dot(dtv[i],qv)
# denominator = np.sqrt(np.sum(np.power(dtv[i],2)) * np.sum(np.power(qv,2)))
# csim.append(((numerator / (denominator + 1e-9)), i + 1))
a = np.array(dtv[i])
numerator = np.dot(a, b)
denominator = np.linalg.norm(a)*np.linalg.norm(b)
csim.append(((numerator / (denominator + 1e-9)), i + 1))
return csim
Here dtv is a list of lists of document vectors and qv is vector representation of the query vector in the form of a list. How can I reduce the computation time of cosine similarity?
Here is one reference I found but I could not understand how the cosine similarity was being computed.
I am looking to reduce the computation time in the order of milliseconds range.
Here is the TFIDF vector representation of documents 1. Here is the TFIDF vector representation of a query.
Upvotes: 0
Views: 1796
Reputation: 1
One way to compute the cosine similarities between two batches of vectors would be to first create Numpy matrixes for each of the batch of vectors, each one of shape (n_vectors, vector_size), like this:
X = np.array(dtv) # dtv is a list of vectors, shape=(len(dtv), len(dtv[0]))
Y = np.array([b]) # b is a single list (one vector), shape(1, len(b))
Then compute the equation for cosine similarity cos = X . Y / ( ||X||*||Y|| ) in a batch form like this:
numerator = np.matmul(X, Y.transpose())
norms_X = np.sqrt((X * X).sum(axis=1)) # Or np.linalg.norm(axis=1)
norms_Y = np.sqrt((Y * Y).sum(axis=1))
denominator = np.matmul(norms_X.reshape(-1, 1), norms_Y.reshape(1, -1))
cs = np.divide(numerator, denominator)
Looking at the implementation in sklearn they do it differently, first normalizing the vectors to unit length and then doing a dot product of the matrixes of shape (n_vectors, vector_size), and according to my tests it's about twice as fast than the above implementation, here a summary of the sklearn implementation:
norms_X = np.sqrt((X * X).sum(axis=1)) # Or np.linalg.norm(axis=1)
X /= norms_X[:, np.newaxis]
norms_Y = np.sqrt((Y * Y).sum(axis=1))
Y /= norms_Y[:, np.newaxis]
cs = np.dot(X, Y.T)
Upvotes: 0