Reputation: 9338
I wish to transform a Collaborative Filtering with Python through Cosine Similarity to Adjusted Cosine Similarity.
The cosine similarity based implementation looks like this:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
from scipy.spatial.distance import pdist, squareform
data = pd.read_csv("C:\\Sample.csv")
data_germany = data.drop("Name", 1)
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
for i in range(0,len(data_ibs.columns)) :
for j in range(0,len(data_ibs.columns)) :
data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))
for i in range(0,len(data_ibs.columns)):
data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index
df = data_neighbours.head().ix[:,2:6]
print df
an the Sample.csv being used looked like:
where 1
denotes that a user purchased a particular fruit, and conversely 0
denotes that a user didn't purchase a particular fruit
When I run the code above this is what I get:
where rows are fruits and columns are similarity ranks (in decreasing order). In this example, Pear
is the most similar to Apple
, Melon
is the second most similar, and so on.
I came across this post on Adjusted Cosine Similarity and I tried to integrate that approach into my code. In this case the data are rating scores given by users to the fruit:
Here's my attempt:
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
M_u = data_ibs.mean(axis=1)
M = np.asarray(data_ibs)
item_mean_subtracted = M - M_u[:, None]
for i in range(0,len(data_ibs.columns)) :
for j in range(0,len(data_ibs.columns)) :
data_ibs.ix[i,j] = 1 - squareform(pdist(item_mean_subtracted.T, "cosine")) ### error
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))
for i in range(0,len(data_ibs.columns)):
data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index
df = data_neighbours.head().ix[:,2:6]
But I'm stuck. My question is: how can the Adjusted Cosine Similarity be successfully applied into this sample?
Upvotes: 1
Views: 4307
Reputation: 13723
Here's a NumPy based solution to your problem.
First we store rating data into an array:
fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])
Then we calculate the adjusted cosine similarity matrix:
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
And finally we sort the results in decreasing order of similarity:
indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:-1])
result = np.hstack((fruits[:, None], fruits[indices]))
DEMO
In [49]: M
Out[49]:
array([[ 0, 10, 0, 1, 0],
[ 6, 0, 0, 0, 2],
[ 1, 0, 20, 0, 1],
[ 0, 3, 6, 0, 18],
[ 3, 0, 2, 0, 0],
[ 0, 2, 0, 5, 0]])
In [50]: np.set_printoptions(precision=2)
In [51]: similarity_matrix
Out[51]:
array([[ 1. , 0.01, -0.41, 0.48, -0.44],
[ 0.01, 1. , -0.57, 0.37, -0.26],
[-0.41, -0.57, 1. , -0.56, -0.19],
[ 0.48, 0.37, -0.56, 1. , -0.51],
[-0.44, -0.26, -0.19, -0.51, 1. ]])
In [52]: result
Out[52]:
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
['Melon', 'Pear', 'Orange', 'Apple', 'Grape']],
dtype='|S6')
Upvotes: 2