raviTeja
raviTeja

Reputation: 358

How to get word embeddings from the pretrained transformers

I am working on a word-level classification task on multilingual data, I am using XLM-R, I know that XLM-R uses sentencepiece as tokenizers which sometimes tokenizes words into subword.

For example the sentence "deception master" is tokenized as de ception master, the word deception has been tokenized into two sub-words.

How can I get the embedding of deception. I can take the mean of the subwords to get the embedding of the word as done here. But I have to implement my code in TensorFlow and TensorFlow computational graph doesn't support NumPy.

I could store the final hidden embeddings after taking the mean of the subwords into a NumPy array and give this array as input to the model, but I want to fine-tune the transformer.

How to get the word embeddings from the sub-word embeddings given by the transformer

Upvotes: 0

Views: 1968

Answers (1)

Jindřich
Jindřich

Reputation: 11240

Joining subword embeddings into words for word labeling is not how this problem is usually approached. The usual approach is the opposite: keep the subwords as they are, but adjust the labels to respect the tokenization of the pre-trained model.

One of the reasons is that the data is typically in batches. When merging subwords into words, every sentence in the batch would end up having a different length which would require processing each sentence independently and pad the batch again – this would be slow. Also, if you do not average the neighboring embeddings, you get more fine-grained information from the loss function, which tells explicitly what subword is responsible for an error.

When tokenizing using SentencePiece, you can get the indices in the original string:

from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer("deception master", return_offsets_mapping=True)

This returns the following dictionary:

{'input_ids': [0, 8, 63928, 31347, 2],
 'attention_mask': [1, 1, 1, 1, 1],
 'offset_mapping': [(0, 0), (0, 2), (2, 9), (10, 16), (0, 0)]}

With the offsets, you can find out if the subword corresponds to a word that you want to label. There are various strategies that could be used for encoding the labels. The easiest one is just to copy the label to every subword. A more fancy way would be using schemes used in named entity recognition, such as IOB tagging that explicitly says what is the begging of the labeled segment.

Upvotes: 1

Related Questions