Reputation: 14415
Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens
functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer
. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
add_tokens
function expecting masked tokens, for example : '##ah'
, '##red'
, '##ik'
, '##si
', etc.? If yes, is there a procedure to generate such masked tokens?Any help would be appreciated.
Thanks in advance.
Upvotes: 3
Views: 3096
Reputation: 11220
If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##
-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus
gets split into ['wal', '##rus']
and you need both of the wordpieces to be in the vocabulary, but not ##wal
or rus
.
Upvotes: 3