muzamil
muzamil

Reputation: 306

Using Pretrained BERT model to add additional words that are not recognized by the model

I want some help regarding adding additional words in the existing BERT model. I have two quires kindly guide me:

I am working on NER task for a domain:

There are few words (not sure the exact numbers) that BERT recognized as [UNK], but those entities are required for the model to recognize. The pretrained model learns well (up to 80%) accuracy on "bert-base-cased" while providing labeled data and fine-tune the model but intuitively the model will learn better if it recognize all the entities.

  1. Do i need to add those unknown entities in vocabs.txt and train the model again?

  2. Do i need to train the BERT model on my data from Scratch?

Thanks...

Upvotes: 1

Views: 1440

Answers (1)

Jindřich
Jindřich

Reputation: 11213

BERT works well because it is pre-trained on a very large textual dataset of 3.3 billion words. Training BERT from skratch is resource-demanding and does not pay of in most reasonable use cases.

BERT uses the wordpiece algorithm for input segmentation. This shoudl in theory ensure that there no out-of-vocabulary token that would end up as [UNK]. The worst-case scenario in the segmentation would be that input tokens end up segmented into individual characters. If the segmentation is done correctly, [UNK] should appear only if the tokenizer encouters UTF-8 that were not in the training data.

The most probably sources of your problem:

  1. There is a bug in the tokenization, so it produces tokens that are not in the word-piece vocabulary. (Perhaps word tokenization instead of WordPiece tokenization?)

  2. It is an encoding issue that generates invalid or weird UTF-8 characters.

Upvotes: 1

Related Questions