Reputation: 23
I have a dataframe with user_ids as columns and the ids of the movies they've liked as row values. Here's a snippet:
15 30 50 93 100 113 1008 1028
0 3346.0 42779.0 1816.0 191319.0 138.0 183.0 171.0 283.0
1 1543.0 NaN 169.0 5319.0 34899.0 188.0 42782.0 1183.0
2 5942.0 NaN 30438.0 195514.0 169.0 172.0 187.0 5329.0
3 3249.0 NaN 32361.0 225.0 87.0 547.0 6710.0 283.0
4 794.0 NaN 187.0 195734.0 6297.0 8423.0 1289.0 222.0
I'm trying to calculate the Jaccard Similarity between each column (i.e. between each user using the movies they've liked). Python gives the following error when I try to use the jaccard_similarity_score found in sklearn:
ValueError: continuous is not supported
Ideally, as a result, I would like to get a matrix with rows and columns of user_id's and the values as the similarity scores for each.
How can I go about computing the jaccard similarities between these columns? I've tried to use a list of dictionaries with keys as user Ids and values as lists of movies, but it takes forever to compute.
Upvotes: 1
Views: 5189
Reputation: 5718
Since sklearn.metrics.jaccard_similarity_score
expects two input vectors of equal length you could try something like the following, partially addapted from this similar question.
import itertools
import pandas as pd
# Method to compute Jaccard similarity index between two sets
def compute_jaccard(user1_vals, user2_vals):
intersection = user1_vals.intersection(user2_vals)
union = user1_vals.union(user2_vals)
jaccard = len(intersection)/float(len(union))
return jaccard
# Small test dataframe
users = ['user1', 'user2', 'user3']
df = pd.DataFrame(
np.transpose(np.array([[1,2,3],[3,np.NAN,7], [np.NAN, np.NAN,3]])),
columns=users)
sim_df = pd.DataFrame(columns=users, index=users)
# Iterate through columns and compute metric
for col_pair in itertools.combinations(df.columns, 2):
u1= col_pair[0]
u2 = col_pair[1]
sim_df.loc[col_pair] = compute_jaccard(set(df[u1].dropna()), set(df[u2].dropna()))
print sim_df
This returns the following (upper triangular) half of the similarity matrix where the diagonal would of course be all ones.
user1 user2 user3
user1 NaN 0.25 0.333333
user2 NaN NaN 0.5
user3 NaN NaN NaN
Upvotes: 1