Reputation: 2075
How do i get an embedding for the whole sentence from huggingface's feature extraction pipeline?
I understand how to get the features for each token (below) but how do i get the overall features for the sentence as a whole?
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
Upvotes: 4
Views: 15899
Reputation: 16587
As mentioned in the previous answers different strategies exist. Here are two:
from transformers import pipeline
feature_extractor = pipeline('feature-extraction', model="BAAI/bge-small-en-v1.5",
model_kwargs={'cache_dir': 'cache/hf'})
text = "Hello world example!"
# Strategy1 : average pooling
print(feature_extractor(text, return_tensors="pt")[0].numpy().mean(axis=0).shape)
# Strategy2 : CLS token
print(feature_extractor(text, return_tensors="pt")[0].numpy()[0].shape)
Read more about the feature-extraction pipeline here.
Upvotes: 0
Reputation: 1185
If you want to get meaningful embedding of whole sentence, please use SentenceTransformers. Pooling is well implemented in it and it also provides various APIs to Fine Tune models to produce features/embeddings at sentence/text-chunk level
pip install sentence-transformers
Once you have installed sentence-transformers, below code can be used to produce sentence embeddings
from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('distilroberta-base')
embeddings = model_st.encode('I am a sentence')
print(embeddings)
Visit official site for more info on sentence transformers.
Upvotes: 4
Reputation: 11430
To explain more on the comment that I have put under stackoverflowuser2010's answer, I will use "barebone" models, but the behavior is the same with the pipeline
component.
BERT and derived models (including DistilRoberta, which is the model you are using in the pipeline) agenerally indicate the start and end of a sentence with special tokens (mostly denoted as [CLS]
for the first token) that usually are the easiest way of making predictions/generating embeddings over the entire sequence. There is a discussion within the community about which method is superior (see also a more detailed answer by stackoverflowuser2010 here), however, if you simply want a "quick" solution, then taking the [CLS]
token is certainly a valid strategy.
Now, while the documentation of the FeatureExtractionPipeline
isn't very clear, in your example we can easily compare the outputs, specifically their lengths, with a direct model call:
from transformers import pipeline, AutoTokenizer
# direct encoding of the sample sentence
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")
# your approach
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
# Compare lengths of outputs
print(len(encoded_seq)) # 5
# Note that the output has a weird list output that requires to index with 0.
print(len(features[0])) # 5
When inspecting the content of encoded_seq
, you will notice that the first token is indexed with 0
, denoting the beginning-of-sequence token (in our case, the embedding token). Since the output lengths are the same, you could then simply access a preliminary sentence embedding by doing something like
sentence_embedding = features[0][0]
Upvotes: 7
Reputation: 40909
If you have the embeddings for each token, you can create an overall sentence embedding by pooling (summarizing) over them. Note that if you have D-dimensional token embeddings, you should get a D-dimensional sentence embeddings through one of these approaches:
Compute the mean over all token embeddings.
Compute the max of each of the D-dimensions over all the token embeddings.
Upvotes: 0