Kolev Kriso
Kolev Kriso

Reputation: 11

Cosine similarity between query and document in a search engine

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.
Lets say I have the tf-idf vectors for the query and a document. I want to compute the cosine similarity between both vectors. When I compute the magnitude for the document vector do I sum the squares of all the terms in the vector or just the terms in the query?

Here is an example : we have user query "cat food beef" . Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document) We have a document "Beef is delicious" Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.

Upvotes: 1

Views: 3462

Answers (1)

alvas
alvas

Reputation: 122260

Cosine similarity is simply a fraction where

  • the numerator is the dot product between 2 vectors
  • the denominator is product of the magnitude of the 2 vectors
    • i.e. euclidean length, i.e. the square root of the dot product of the vector with itself

for the numerator, e.g. in numpy:

>>> import numpy as np
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> np.dot(x,y)
1.0

Similarly if we compute the dot product by multiply x_i and y_i and summing the individual elements:

>>> x_dot_y = sum([(1.0 * 0.0) + (1.0 * 1.0) + (1.0 * 0.0) + (0.0 * 1.0) + (0.0 * 1.0)])
>>> x_dot_y
1.0

For the denominator, we can compute the magnitude in numpy:

>>> from numpy.linalg import norm
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> norm(x) * norm(y)
2.9999999999999996

Similarly, if we compute the euclidean length without numpy

>>> import math
# with np.dot
>>> math.sqrt(np.dot(x,x)) * math.sqrt(np.dot(y,y))
2.9999999999999996

So the cosine similarity is:

>>> cos_x_y = np.dot(x,y) / (norm(x) * norm(y))
>>> cos_x_y
0.33333333333333337

You can also use the cosine distance function directly from scipy:

>>> from scipy import spatial
>>> 1 - spatial.distance.cosine(x,y)
0.33333333333333337

See also

Upvotes: 1

Related Questions