Reputation: 11
I need to code a similarity score in python to find matches based on movie genre.
The comparison is for 1 user to find similarity between their genre scores in binary and a dataframe of genre scores in binary for 40,000 movie titles. I need to iterate through the dataframe and compare each item with the users score to find similarity.
For instance take user 1: Score [0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,1]
Compare similarity to Movies dataframe: Movies Dataframe
I would like to come up with a score for a similarity measure between the user and each title in order to rank the titles that are most similar to the users preference.
I have found that Hamming distance is probably the best method for binary values. How can I implement this? Thanks
Upvotes: 0
Views: 350
Reputation: 25199
Try:
from scipy.spatial.distance import cdist
# data sample example
x = np.random.randint(0,2,100).reshape(10,10)
# pairwise hamming distance
cdist(x,x, metric="hamming")
array([[0. , 0.6, 0.8, 0.6, 0.3, 0.4, 0.7, 0.4, 0.5, 0.6],
[0.6, 0. , 0.4, 0.6, 0.7, 0.4, 0.3, 0.6, 0.5, 0.6],
[0.8, 0.4, 0. , 0.4, 0.7, 0.4, 0.3, 0.8, 0.5, 0.4],
[0.6, 0.6, 0.4, 0. , 0.3, 0.6, 0.5, 0.6, 0.3, 0.4],
[0.3, 0.7, 0.7, 0.3, 0. , 0.5, 0.6, 0.5, 0.4, 0.5],
[0.4, 0.4, 0.4, 0.6, 0.5, 0. , 0.5, 0.6, 0.7, 0.4],
[0.7, 0.3, 0.3, 0.5, 0.6, 0.5, 0. , 0.5, 0.4, 0.3],
[0.4, 0.6, 0.8, 0.6, 0.5, 0.6, 0.5, 0. , 0.5, 0.6],
[0.5, 0.5, 0.5, 0.3, 0.4, 0.7, 0.4, 0.5, 0. , 0.3],
[0.6, 0.6, 0.4, 0.4, 0.5, 0.4, 0.3, 0.6, 0.3, 0. ]])
You may wish to go one step further and define a function that will tell you the index of most similar output for an index of input of interest:
hamming_distance = cdist(x,x, metric="hamming")
most_similar = lambda i: np.argmax(hamming_distance[i])
most_similar(2)
0
Upvotes: 0