Reputation: 11
I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.
Lets say I have the tf-idf vectors for the query and a document. I want to compute the cosine similarity between both vectors. When I compute the magnitude for the document vector do I sum the squares of all the terms in the vector or just the terms in the query?
Here is an example : we have user query "cat food beef" . Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document) We have a document "Beef is delicious" Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.
Upvotes: 1
Views: 3462
Reputation: 122260
Cosine similarity is simply a fraction where
for the numerator, e.g. in numpy
:
>>> import numpy as np
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> np.dot(x,y)
1.0
Similarly if we compute the dot product by multiply x_i and y_i and summing the individual elements:
>>> x_dot_y = sum([(1.0 * 0.0) + (1.0 * 1.0) + (1.0 * 0.0) + (0.0 * 1.0) + (0.0 * 1.0)])
>>> x_dot_y
1.0
For the denominator, we can compute the magnitude in numpy
:
>>> from numpy.linalg import norm
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> norm(x) * norm(y)
2.9999999999999996
Similarly, if we compute the euclidean length without numpy
>>> import math
# with np.dot
>>> math.sqrt(np.dot(x,x)) * math.sqrt(np.dot(y,y))
2.9999999999999996
So the cosine similarity is:
>>> cos_x_y = np.dot(x,y) / (norm(x) * norm(y))
>>> cos_x_y
0.33333333333333337
You can also use the cosine distance function directly from scipy
:
>>> from scipy import spatial
>>> 1 - spatial.distance.cosine(x,y)
0.33333333333333337
See also
Upvotes: 1