Nithin Varghese
Nithin Varghese

Reputation: 923

Not able to understand python function of cosine similarity

I am working through the example in the blog to understand collaborative filter method used in recommendation system.I came across cosine similarity expressed as

enter image description here

In python using numpy its written as

def similarity(ratings, kind='user', epslion=1e-9):
      if kind == 'user' :
         sim = ratings.dot(ratings.T)
      elif (kind=='item'):
          sim = ratings.T.dot(ratings) + epslion
      norms = np.array([np.sqrt(np.diagonal(sim))])

      return  (sim / norms / norms.T )

Questions

  1. what is the math behind calculation of the norm in the code
  2. how does (sim / norms / norms.T ) equals to sim(u,u`) in the equation

Thanks in advance for your time – if I’ve missed out anything, over- or under-emphasized a specific point let me know in the comments.

Upvotes: 4

Views: 295

Answers (1)

ewcz
ewcz

Reputation: 13087

The ratings are stored in the numpy matrix ratings, where rows correspond to users (index u), while columns correspond to items (index i). Since you want to calculate sim(u, u'), i.e., the similarity between users, let's assume below that kind = 'user'.

Now, let's have a look first on r_{ui}r_{u'i} without the square-root scaling factors. This expression is summed over i which can be interpreted as matrix multiplication of r with the transpose of r, i.e.:

\sum_i r_{ui}r_{u'i} = \sum_i r_{ui}(r^T)_{iu'} =: s_{uu'}

As done already above, let's denote the resulting matrix as s (variable sim in the code). This matrix is by definition symmetric and its rows/columns are labeled with the "user" indices u/u'.

Now, the scaling "factor" f_{u} := \sqrt\sum_i r^2_{ui} is in fact a vector indexed with u (each item of which is the Euclidean norm of the corresponding row of the matrix r). However, having constructed s_{uu'}, we can see that f_{u} is nothing else than \sqrt s_{uu}.

Finally, the similarity factor of interest is s_{uu'}/f{u}/f{u'}. The posted code calculates this for all indices u/u' and returns the result as a matrix. To this end, it:

  1. calculates sim (the matrix s above) as ratings.dot(ratings.T)
  2. gets the square root of its diagonal (the vector f above) as np.sqrt(np.diagonal(sim))
  3. to do the row/column scaling of s in an efficient way, this is then expressed as a two-dimensional array norms = np.array([np.sqrt(np.diagonal(sim))]) (note the missing [] in your post)
  4. finally, the matrix s_{uu'}/f{u}/f{u'} is calculated as sim / norms / norms.T. Here, since norms has shape (1, num_of_users), the first division does the column scaling, while the division with norms.T scales the rows.

Upvotes: 3

Related Questions