Reputation: 923
I am working through the example in the blog to understand collaborative filter method used in recommendation system.I came across cosine similarity expressed as
In python using numpy its written as
def similarity(ratings, kind='user', epslion=1e-9):
if kind == 'user' :
sim = ratings.dot(ratings.T)
elif (kind=='item'):
sim = ratings.T.dot(ratings) + epslion
norms = np.array([np.sqrt(np.diagonal(sim))])
return (sim / norms / norms.T )
Questions
(sim / norms / norms.T )
equals to sim(u,u`)
in the equationThanks in advance for your time – if I’ve missed out anything, over- or under-emphasized a specific point let me know in the comments.
Upvotes: 4
Views: 295
Reputation: 13087
The ratings are stored in the numpy matrix ratings
, where rows correspond to users (index u
), while columns correspond to items (index i
). Since you want to calculate sim(u, u')
, i.e., the similarity between users, let's assume below that kind = 'user'
.
Now, let's have a look first on r_{ui}r_{u'i}
without the square-root scaling factors. This expression is summed over i
which can be interpreted as matrix multiplication of r
with the transpose of r
, i.e.:
\sum_i r_{ui}r_{u'i} = \sum_i r_{ui}(r^T)_{iu'} =: s_{uu'}
As done already above, let's denote the resulting matrix as s
(variable sim
in the code). This matrix is by definition symmetric and its rows/columns are labeled with the "user" indices u/u'
.
Now, the scaling "factor" f_{u} := \sqrt\sum_i r^2_{ui}
is in fact a vector indexed with u
(each item of which is the Euclidean norm of the corresponding row of the matrix r
). However, having constructed s_{uu'}
, we can see that f_{u}
is nothing else than \sqrt s_{uu}
.
Finally, the similarity factor of interest is s_{uu'}/f{u}/f{u'}
. The posted code calculates this for all indices u/u'
and returns the result as a matrix. To this end, it:
sim
(the matrix s
above) as ratings.dot(ratings.T)
f
above) as np.sqrt(np.diagonal(sim))
s
in an efficient way, this is then expressed as a two-dimensional array norms = np.array([np.sqrt(np.diagonal(sim))])
(note the missing []
in your post)s_{uu'}/f{u}/f{u'}
is calculated as sim / norms / norms.T
. Here, since norms
has shape (1, num_of_users)
, the first division does the column scaling, while the division with norms.T
scales the rows. Upvotes: 3