Computing Jaccard Similarity between DataFrame Columns with Different Lengths

Question

I have a dataframe with user_ids as columns and the ids of the movies they've liked as row values. Here's a snippet:

   15       30       50        93       100     113      1008    1028    
0  3346.0  42779.0   1816.0  191319.0    138.0   183.0    171.0   283.0   
1  1543.0      NaN    169.0    5319.0  34899.0   188.0  42782.0  1183.0   
2  5942.0      NaN  30438.0  195514.0    169.0   172.0    187.0  5329.0   
3  3249.0      NaN  32361.0     225.0     87.0   547.0   6710.0   283.0   
4   794.0      NaN    187.0  195734.0   6297.0  8423.0   1289.0   222.0

I'm trying to calculate the Jaccard Similarity between each column (i.e. between each user using the movies they've liked). Python gives the following error when I try to use the jaccard_similarity_score found in sklearn:

ValueError: continuous is not supported

Ideally, as a result, I would like to get a matrix with rows and columns of user_id's and the values as the similarity scores for each.

How can I go about computing the jaccard similarities between these columns? I've tried to use a list of dictionaries with keys as user Ids and values as lists of movies, but it takes forever to compute.

elz · Accepted Answer

Since sklearn.metrics.jaccard_similarity_score expects two input vectors of equal length you could try something like the following, partially addapted from this similar question.

import itertools
import pandas as pd

# Method to compute Jaccard similarity index between two sets
def compute_jaccard(user1_vals, user2_vals):
    intersection = user1_vals.intersection(user2_vals)
    union = user1_vals.union(user2_vals)
    jaccard = len(intersection)/float(len(union))
    return jaccard

# Small test dataframe
users = ['user1', 'user2', 'user3']
df = pd.DataFrame( 
    np.transpose(np.array([[1,2,3],[3,np.NAN,7], [np.NAN, np.NAN,3]])), 
    columns=users)
sim_df = pd.DataFrame(columns=users, index=users)

# Iterate through columns and compute metric
for col_pair in itertools.combinations(df.columns, 2):
    u1= col_pair[0]
    u2 = col_pair[1]
    sim_df.loc[col_pair] = compute_jaccard(set(df[u1].dropna()), set(df[u2].dropna()))


print sim_df

This returns the following (upper triangular) half of the similarity matrix where the diagonal would of course be all ones.

        user1  user2     user3
user1   NaN    0.25      0.333333
user2   NaN    NaN       0.5
user3   NaN    NaN       NaN

Computing Jaccard Similarity between DataFrame Columns with Different Lengths

Answers (1)

Related Questions