Reputation: 81
I have a data which is having more than 1500 rows. Each row has a sentence. I am trying to find out the best method to find the most similar sentences among all. I have tried this example but the processing is so much slow that it took around 20 minutes for 1500 rows data.
I have used the code from my previous question and tried many types to improve the speed but it doesn't affect much. I came across universal sentence encoder using tensorflow which seems fast and having good accuracy. I am working on colab you can check it here
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5", "https://tfhub.dev/google/universal-sentence-encoder-lite/2"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
message_embeddings = embed(messages)
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
What I am looking for
I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.
Data Sample
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect
12 | Move to QC
13 | Cancel ASN WMS Cancel ASN
14 | MAXPREDO Validation is right
15 | Verify files are sent every hours for this interface from Optima
16 | MAXPREDO Validation are correct
17 | Move to QC
18 | Verify files are not sent
Expected result
Above data which are similar upto 0.90% should get as a result with ID
ID | DESCRIPTION
-----------------------------
10 | Cancel ASN WMS Cancel ASN
13 | Cancel ASN WMS Cancel ASN
11 | MAXPREDO Validation is corect # even spelling is not correct
14 | MAXPREDO Validation is right
16 | MAXPREDO Validation are correct
12 | Move to QC
17 | Move to QC
Upvotes: 1
Views: 1565
Reputation: 3564
There are multiple ways in which you can find similarity between two embedding vectors.
The most common is cosine_similarity
.
Therefore the first thing you have to do is calculate the similarity matrix:
Code:
message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)
You get a 9*9
matrix with similarity value.
You can create a heatmap of this matrix to visualize it.
Code:
def plot_similarity(labels, corr_matrix):
sns.set(font_scale=1.2)
g = sns.heatmap(
corr_matrix,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=90)
g.set_title("Semantic Textual Similarity")
plot_similarity(list(df['DESCRIPTION']), cos_sim)
Output:
The darker box means more similarity.
And finally, you iterate over this cos_sim matrix to get all the similar sentence using threshold:
threshold = 0.8
row_index = []
for i in range(cos_sim.shape[0]):
if i in row_index:
continue
similar = [index for index in range(cos_sim.shape[1]) if (cos_sim[i][index] > threshold)]
if len(similar) > 1:
row_index += similar
sim_df = pd.DataFrame()
sim_df['ID'] = [df['ID'][i] for i in row_index]
sim_df['DESCRIPTION'] = [df['DESCRIPTION'][i] for i in row_index]
sim_df
The data frame looks like this.
Output:
There, are different methods with which you can generate the similarity matrix. You can take a look at this for more methods.
Upvotes: 3