Regressor
Regressor

Reputation: 1973

How to use paraphrase_mining using sentence transformers pre-trained model

I am trying to find similarity between sentences using a pre-trained sentence-transformers model. I am trying to follow the code here - https://www.sbert.net/docs/usage/paraphrase_mining.html

In trial one I run 2 for-loops where in I try to find similarity of given sentence with every other sentence. Here is the code for that -

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')


# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

print(len(pairs))
6

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

A man is playing guitar          Do you like pizza?          Score: 0.1080
The new movie is awesome         Do you like pizza?          Score: 0.0829
A man is playing guitar          The new movie is awesome        Score: 0.0652
The cat sits outside         Do you like pizza?          Score: 0.0523
The cat sits outside         The new movie is awesome        Score: -0.0270
The cat sits outside         A man is playing guitar         Score: -0.0530

This works as expected because there can be 6 combinations of the similarity scores between the combinations of 4 sentences. On their documentation page, they mention that this does not scale well because of quadratic complexity and hence they recommend using paraphrase_mining() method.

But when I try to use that method, I do not get 6 combinations but instead only get 5. Why is that the case?

Here is the sample code that I try using the paraphrase_mining() method -

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']


paraphrases = util.paraphrase_mining(model, sentences)
print(len(paraphrases))
5

k = 0
for paraphrase in paraphrases:
    print(k)
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
    print()
    k = k + 1

0
A man is playing guitar          Do you like pizza?          Score: 0.1080

1
The new movie is awesome         Do you like pizza?          Score: 0.0829

2
A man is playing guitar          The new movie is awesome        Score: 0.0652

3
The cat sits outside         Do you like pizza?          Score: 0.0523

4
The cat sits outside         The new movie is awesome        Score: -0.0270

Is there a difference between how the paraphrase_mining() works?

Upvotes: 0

Views: 2858

Answers (1)

Yoko
Yoko

Reputation: 76

Thanks for pointing this out.

There was a small bug in the paraphrase_mining function when the list of sentences is rather small. Instead of computing all combinations, it only computed only n-1 combinations for each sentence. For large list of sentences, this is no issue if, but for your specific example, it ignored the most irrelevant combinations and returned viewer pairs than intended.

It is fixed in the repository and will be part of the next release.

PS: You can post your questions also on Github here: https://github.com/UKPLab/sentence-transformers/issues

There I get an email notification and can respond quicker.

Upvotes: 1

Related Questions