Reputation: 273
My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that.
The thing is, am I supposed to add the domain-specific tokens before the MLM training, or I just let Bert figure out the context? If I choose to not include the tokens, am I going to get a poor task-specific model like NER?
To give you more background of my situation, I'm training a Bert model on medical text using Portuguese language, so, deceased names, drug names, and other stuff are present in my corpus, but I'm not sure I have to add those tokens before the training.
I saw this one: Using Pretrained BERT model to add additional words that are not recognized by the model
But the doubts remain, as other sources say otherwise.
Thanks in advance.
Upvotes: 3
Views: 1517
Reputation: 2348
Yes, you have to add them to the models vocabulary.
tokenizer = BertTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(['new', 'rdemorais', 'blabla'])
model = Bert.from_pretrained(model_name, return_dict=False)
model.resize_token_embeddings(len(tokenizer))
The last line is important and needed since you change the numbers of tokens in the model's vocabulary, you also need to update the model correspondingly.
Upvotes: 3