Reputation: 519
I have CSV file that looks like:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
I would like to have the code that reads the file and performs the similarity difference.
I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.
based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".
It would be great if you can provide your help with code that I can run in my side as well. Thanks
Upvotes: 0
Views: 361
Reputation: 59549
I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
K = len(df)
similarity = np.empty((K,K), dtype=float)
for i, ac in enumerate(df['messages']):
for j, bc in enumerate(df['messages']):
if i > j:
continue
if i == j:
sim = 100
else:
sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
# for comparison of 2 strings.
similarity[i, j] = sim
similarity[j, i] = sim
df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)
df_sim
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0
Upvotes: 1