Reputation: 306
I want some help regarding adding additional words in the existing BERT model. I have two quires kindly guide me:
I am working on NER task for a domain:
There are few words (not sure the exact numbers) that BERT recognized as [UNK], but those entities are required for the model to recognize. The pretrained model learns well (up to 80%) accuracy on "bert-base-cased" while providing labeled data and fine-tune the model but intuitively the model will learn better if it recognize all the entities.
Do i need to add those unknown entities in vocabs.txt and train the model again?
Do i need to train the BERT model on my data from Scratch?
Thanks...
Upvotes: 1
Views: 1440
Reputation: 11213
BERT works well because it is pre-trained on a very large textual dataset of 3.3 billion words. Training BERT from skratch is resource-demanding and does not pay of in most reasonable use cases.
BERT uses the wordpiece algorithm for input segmentation. This shoudl in theory ensure that there no out-of-vocabulary token that would end up as [UNK]
. The worst-case scenario in the segmentation would be that input tokens end up segmented into individual characters. If the segmentation is done correctly, [UNK]
should appear only if the tokenizer encouters UTF-8 that were not in the training data.
The most probably sources of your problem:
There is a bug in the tokenization, so it produces tokens that are not in the word-piece vocabulary. (Perhaps word tokenization instead of WordPiece tokenization?)
It is an encoding issue that generates invalid or weird UTF-8 characters.
Upvotes: 1