user2543622
user2543622

Reputation: 6766

sentence transformer how to predict new example

I am exploring sentence transformers and came across this page. It shows how to train on our custom data. But I am not sure how to predict. If there are two new sentences such as 1) this is the third example, 2) this is the example number three. How could I get a prediction about how similar those sentences are?

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

----------------------------update 1

I updated the code as below

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

Saved the model...main change as compared to the old code

model_save_path2 = '/content/gdrive/MyDrive/folderName1/folderName2/model_try-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

#Tune the model and save it too
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100,output_path=model_save_path2)

Not sure about the below steps

#loading the new model
model_new = SentenceTransformer(model_save_path)

#predicting
sentences = ["This is an example sentence", "Each sentence is converted"]
model_new.encode(sentences)

question 1)

is this a correct approach to get sentence embedding after training old model and creating a new model? I am confused because during fitting process we fed two sentences along with similarity measure. While for output we are inputting one sentence at a time and getting a sentence embedding for each sentence.

question 2)

If I would like to get similarity scores for two sentences, is the only option is to take sentence embeddings from output of this model and then use cosine similarity?

Upvotes: 5

Views: 3536

Answers (1)

James Briggs
James Briggs

Reputation: 914

Q1) Sentence transformers create sentence embeddings/vectors, you give it a sentence and it outputs a numerical representation (eg vector) of that sentence. The reason you feed in two sentences at a time during training is because the model is being optimized to output similar or dissimilar vectors for similar or dissimilar sentence pairs.

During training the model is actually processing one sentence at a time, so you feed in one sentence after the other, producing the two embeddings. The cosine similarity between the two embeddings is calculated and the loss is calculated based on the difference between the predicted similarity (output but the cosine similarity function) and the true similarity (from the label feature of your data).

So during training, that final step of calculating cosine similarity is included as you are optimizing on CosineSimilarityLoss

Q2) When using a sentence transformer that is the correct process. You could alternatively use a cross encoder model which outputs a similarity score directly, however this negates the advantage of sentence encoders in that you can create the sentence embeddings and then store them in a vector database for later use.

If you then need to calculate the similarity between a new sentence and thousands of previously encoded sentences, with the sentence embeddings you just compute the cosine similarity between all pairs. With a cross encoder you would need to feed each pair into the cross encoder and perform a full BERT (if using bert cross encoder) inference step, which takes much longer than a cosine similarity computation.

Upvotes: 7

Related Questions