Reputation: 21
Referring to this link
which calculates adjusted cosine similarity matrix (given the ratings matrix M having m users and n items) as below:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I cannot see how the 'both rated' condition is met as per this definition
I have manually calculated the adjusted cosine similarities and they seem to differ with the values I get from above code.
Could anyone please clarify this?
Upvotes: 2
Views: 4289
Reputation: 3586
Let's first try to understand the formulation, the matrix is stored such that each row is a user and each column is an item. User is indexed by u and column is indexed by i.
Each user have different judgement rule of how good or how bad is something is. A 1 from a user could be a 3 from another user. That is why we subtract the average of each R_u, from each R_{u,i}. This is computed as item_mean_subtracted in your code. Notice that we are subtracting each element by its row mean to normalize the user's biasness. After which, we normalized each column (item) by dividing each column by its norm and then compute the cosine similarity between each column.
pdist(item_mean_subtracted.T, 'cosine') computes the cosine distance between the items and it is known that
cosine similarity = 1- cosine distance
and hence that is why the code works.
Now, what if I compute it directly according to the definition directly? I have commented what is being performed in each step, try to copy and paste the code and you can compare with your calculation by printing out more intermediate steps.
import numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import norm
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
print(similarity_matrix)
#Computing the cosine similarity directly
n = len(M[0]) # find out number of columns(items)
normalized = item_mean_subtracted/norm(item_mean_subtracted, axis = 0).reshape(1,n) #divide each column by its norm, normalize it
normalized = normalized.T #transpose it
similarity_matrix2 = np.asarray([[np.inner(normalized[i],normalized[j] ) for i in range(n)] for j in range(n)]) # compute the similarity matrix by taking inner product of any two items
print(similarity_matrix2)
Both of the codes give the same result:
[[ 1. 0.86743396 0.39694169 -0.67525773 -0.72426278]
[ 0.86743396 1. 0.80099604 -0.64553225 -0.90790362]
[ 0.39694169 0.80099604 1. -0.37833504 -0.80337196]
[-0.67525773 -0.64553225 -0.37833504 1. 0.26594024]
[-0.72426278 -0.90790362 -0.80337196 0.26594024 1. ]]
Upvotes: 2