How to use paraphrase_mining using sentence transformers pre-trained model

Question

I am trying to find similarity between sentences using a pre-trained sentence-transformers model. I am trying to follow the code here - https://www.sbert.net/docs/usage/paraphrase_mining.html

In trial one I run 2 for-loops where in I try to find similarity of given sentence with every other sentence. Here is the code for that -

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')


# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

print(len(pairs))
6

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} 		 {} 		 Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

A man is playing guitar          Do you like pizza?          Score: 0.1080
The new movie is awesome         Do you like pizza?          Score: 0.0829
A man is playing guitar          The new movie is awesome        Score: 0.0652
The cat sits outside         Do you like pizza?          Score: 0.0523
The cat sits outside         The new movie is awesome        Score: -0.0270
The cat sits outside         A man is playing guitar         Score: -0.0530

This works as expected because there can be 6 combinations of the similarity scores between the combinations of 4 sentences. On their documentation page, they mention that this does not scale well because of quadratic complexity and hence they recommend using paraphrase_mining() method.

But when I try to use that method, I do not get 6 combinations but instead only get 5. Why is that the case?

Here is the sample code that I try using the paraphrase_mining() method -

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']


paraphrases = util.paraphrase_mining(model, sentences)
print(len(paraphrases))
5

k = 0
for paraphrase in paraphrases:
    print(k)
    score, i, j = paraphrase
    print("{} 		 {} 		 Score: {:.4f}".format(sentences[i], sentences[j], score))
    print()
    k = k + 1

0
A man is playing guitar          Do you like pizza?          Score: 0.1080

1
The new movie is awesome         Do you like pizza?          Score: 0.0829

2
A man is playing guitar          The new movie is awesome        Score: 0.0652

3
The cat sits outside         Do you like pizza?          Score: 0.0523

4
The cat sits outside         The new movie is awesome        Score: -0.0270

Is there a difference between how the paraphrase_mining() works?

How to use paraphrase_mining using sentence transformers pre-trained model

Answers (1)

Related Questions