edamame
edamame

Reputation: 31

How to get Non-contextual Word Embeddings in BERT?

I am already installed BERT, But I don't know how to get Non-contextual word embeddings.

For example:


input: 'Apple'
output: [1,2,23,2,13,...] #embedding of 'Apple'


How can i get these word embeddings?

Thank you.

I search some method, but no blogs have written the way.

Upvotes: 0

Views: 834

Answers (2)

edamame
edamame

Reputation: 31

Sloved.

import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = AutoModel.from_pretrained("bert-base-uncased")

# get the word embedding from BERT
def get_word_embedding(word:str):
    input_ids = torch.tensor(tokenizer.encode(word)).unsqueeze(0)  # Batch size 1
    # print(input_ids)
    outputs = model(input_ids)
    last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
    # output[0] is token vector
    # output[1] is the mean pooling of all hidden states
    return last_hidden_states[0][1]


Upvotes: 1

Jindřich
Jindřich

Reputation: 11220

BERT uses static subword embeddings in its first layer, where they get summed with learned position embeddings. You can get the embeddings layer by calling model.embeddings.word_embeddings. You should be able to pass the indices that you get from a BertTokenizer to this layer and get the subword embeddings.

There are, however, several caveats with static embeddings: these are not word embeddings but subwords that BERT internally uses (less frequent words get segmented into smaller units). The embeddings are of much worse quality than standard word embeddings (Word2Vec, FastText) because they are trained to get combined with position embeddings and serve in the later layers, not as standalone embeddings.

There are also methods for getting high-quality word embeddings from BERT (and similar models). Those require training data and some computation. AFAIK the best methods are:

Upvotes: 1

Related Questions