user799188
user799188

Reputation: 14415

Transformers PreTrainedTokenizer add_tokens Functionality

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))

I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.

My questions

  1. Am I missing something?
  2. Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?

Any help would be appreciated.

Thanks in advance.

Upvotes: 3

Views: 3096

Answers (1)

Jindřich
Jindřich

Reputation: 11220

If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.

There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.

BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.

Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

Upvotes: 3

Related Questions