Cecilia
Cecilia

Reputation: 309

I have two formulas for calculating 'cosine similarity', what's the difference?

I am doing a project about cosine similarity on movie dataset, I'm confused about the formula for calculating cosine similarity.

enter image description here

But I searched online, some articles show that the denominator is something like : sqrt(A1^2+B1^2) * sqrt(A2^2+B2^2) * ... * sqrt(Ai^2+Bi^2)

I'm confused, what's the difference? Which one is correct or they both are correct?

Upvotes: 0

Views: 681

Answers (1)

Juan Carlos Ramirez
Juan Carlos Ramirez

Reputation: 2129

The one on your image is correct. In two dimensions, it is derived from the Law of cosines which relates the length of one side of a triangle to the length of the other two sides, and the angle opposite c, theta:

c^2==a^2+b^2-2*b*c(cos(theta))

You can prove this in many ways, and a good verification is to know that when cos(gamma)==0 (side a and b are orthogonal), you get the Pythagorean Theorem. To get the formula on the image, you have to translate it into analytical geometry (vectors)

norm(A-B)^2==norm(A)^2+norm(B)^2−2*norm(A)*norm(B)*cos(theta)

and by using that norm(A-B)^2 is by definition (A-B)*(A-B) and expanding we get

norm(A-B)^2 ==norm(A)^2+norm(B)^2-2*A*B

So equating both expressions, and doing cancellations, yields

norm(A)*norm(B)*cos(theta) = A*B

which is the (rearranged) formula on your definition (and the norm(v) = sqrt(v*v)). For n dimensions you can show this works because rotating the euclidean space preserves norm and inner product, and because the 2D plane spanned by the vectors is precisely just a rotation of the xy plane.

A good sanity check is, again that orthogonality yields a cosine of 0, and that the cosine is between 0 and 1 (this is the Cauchy Schwarz theorem)

Update: In the examples mentioned on your comment, you can see the results from the blog by running

import sklearn.metrics.pairwise as pw
print(pw.cosine_similarity([[4,3]],[[5,5]]))
print(pw.cosine_similarity([[4,3,5]],[[5,5,1]]))

note that if you run:

from sklearn.metrics.pairwise import pairwise_distances
print(pairwise_distances([[4,3,5]],[[5,5,1]],metric='cosine')) 

You get 0.208 instead of 0.792, this is because pairwise_distance using the cosine metric is given as 1-cos(theta) (see that 0.208 + 0.792 is 1). You do this transformation because when you talk about distances, you want the distance from a point to itself to be 0.

Upvotes: 1

Related Questions