Reputation: 1973
I am trying to find similarity between sentences using a pre-trained sentence-transformers model. I am trying to follow the code here - https://www.sbert.net/docs/usage/paraphrase_mining.html
In trial one I run 2 for-loops where in I try to find similarity of given sentence with every other sentence. Here is the code for that -
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Single list of sentences
sentences = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome',
'Do you like pizza?']
#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)
#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
for j in range(i+1, len(cosine_scores)):
pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})
#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
print(len(pairs))
6
for pair in pairs[0:10]:
i, j = pair['index']
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))
A man is playing guitar Do you like pizza? Score: 0.1080
The new movie is awesome Do you like pizza? Score: 0.0829
A man is playing guitar The new movie is awesome Score: 0.0652
The cat sits outside Do you like pizza? Score: 0.0523
The cat sits outside The new movie is awesome Score: -0.0270
The cat sits outside A man is playing guitar Score: -0.0530
This works as expected because there can be 6 combinations of the similarity scores between the combinations of 4 sentences. On their documentation page, they mention that this does not scale well because of quadratic complexity and hence they recommend using paraphrase_mining()
method.
But when I try to use that method, I do not get 6 combinations but instead only get 5. Why is that the case?
Here is the sample code that I try using the paraphrase_mining()
method -
# Single list of sentences
sentences = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome',
'Do you like pizza?']
paraphrases = util.paraphrase_mining(model, sentences)
print(len(paraphrases))
5
k = 0
for paraphrase in paraphrases:
print(k)
score, i, j = paraphrase
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
print()
k = k + 1
0
A man is playing guitar Do you like pizza? Score: 0.1080
1
The new movie is awesome Do you like pizza? Score: 0.0829
2
A man is playing guitar The new movie is awesome Score: 0.0652
3
The cat sits outside Do you like pizza? Score: 0.0523
4
The cat sits outside The new movie is awesome Score: -0.0270
Is there a difference between how the paraphrase_mining()
works?
Upvotes: 0
Views: 2858
Reputation: 76
Thanks for pointing this out.
There was a small bug in the paraphrase_mining function when the list of sentences is rather small. Instead of computing all combinations, it only computed only n-1 combinations for each sentence. For large list of sentences, this is no issue if, but for your specific example, it ignored the most irrelevant combinations and returned viewer pairs than intended.
It is fixed in the repository and will be part of the next release.
PS: You can post your questions also on Github here: https://github.com/UKPLab/sentence-transformers/issues
There I get an email notification and can respond quicker.
Upvotes: 1