P Mort
P Mort

Reputation: 11

Computing similarity measure for binary pandas dataframe

I need to code a similarity score in python to find matches based on movie genre.

The comparison is for 1 user to find similarity between their genre scores in binary and a dataframe of genre scores in binary for 40,000 movie titles. I need to iterate through the dataframe and compare each item with the users score to find similarity.

For instance take user 1: Score [0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,1]

Compare similarity to Movies dataframe: Movies Dataframe

I would like to come up with a score for a similarity measure between the user and each title in order to rank the titles that are most similar to the users preference.

I have found that Hamming distance is probably the best method for binary values. How can I implement this? Thanks

Upvotes: 0

Views: 350

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

Try:

from scipy.spatial.distance import cdist
# data sample example
x = np.random.randint(0,2,100).reshape(10,10)
# pairwise hamming distance
cdist(x,x, metric="hamming")
array([[0. , 0.6, 0.8, 0.6, 0.3, 0.4, 0.7, 0.4, 0.5, 0.6],
       [0.6, 0. , 0.4, 0.6, 0.7, 0.4, 0.3, 0.6, 0.5, 0.6],
       [0.8, 0.4, 0. , 0.4, 0.7, 0.4, 0.3, 0.8, 0.5, 0.4],
       [0.6, 0.6, 0.4, 0. , 0.3, 0.6, 0.5, 0.6, 0.3, 0.4],
       [0.3, 0.7, 0.7, 0.3, 0. , 0.5, 0.6, 0.5, 0.4, 0.5],
       [0.4, 0.4, 0.4, 0.6, 0.5, 0. , 0.5, 0.6, 0.7, 0.4],
       [0.7, 0.3, 0.3, 0.5, 0.6, 0.5, 0. , 0.5, 0.4, 0.3],
       [0.4, 0.6, 0.8, 0.6, 0.5, 0.6, 0.5, 0. , 0.5, 0.6],
       [0.5, 0.5, 0.5, 0.3, 0.4, 0.7, 0.4, 0.5, 0. , 0.3],
       [0.6, 0.6, 0.4, 0.4, 0.5, 0.4, 0.3, 0.6, 0.3, 0. ]])

You may wish to go one step further and define a function that will tell you the index of most similar output for an index of input of interest:

hamming_distance = cdist(x,x, metric="hamming")
most_similar = lambda i: np.argmax(hamming_distance[i])
most_similar(2)
0 

Upvotes: 0

Related Questions