Reputation: 31
I am already installed BERT, But I don't know how to get Non-contextual word embeddings.
For example:
input: 'Apple'
output: [1,2,23,2,13,...] #embedding of 'Apple'
How can i get these word embeddings?
Thank you.
I search some method, but no blogs have written the way.
Upvotes: 0
Views: 834
Reputation: 31
Sloved.
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# get the word embedding from BERT
def get_word_embedding(word:str):
input_ids = torch.tensor(tokenizer.encode(word)).unsqueeze(0) # Batch size 1
# print(input_ids)
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
# output[0] is token vector
# output[1] is the mean pooling of all hidden states
return last_hidden_states[0][1]
Upvotes: 1
Reputation: 11220
BERT uses static subword embeddings in its first layer, where they get summed with learned position embeddings. You can get the embeddings layer by calling model.embeddings.word_embeddings
. You should be able to pass the indices that you get from a BertTokenizer
to this layer and get the subword embeddings.
There are, however, several caveats with static embeddings: these are not word embeddings but subwords that BERT internally uses (less frequent words get segmented into smaller units). The embeddings are of much worse quality than standard word embeddings (Word2Vec, FastText) because they are trained to get combined with position embeddings and serve in the later layers, not as standalone embeddings.
There are also methods for getting high-quality word embeddings from BERT (and similar models). Those require training data and some computation. AFAIK the best methods are:
Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings (Bommasani et al., ACL 2020).
Obtaining Better Static Word Embeddings Using Contextual Embedding Models (Gupta & Jaggi, ACL 2021) with code on Github.
Upvotes: 1