How to further pretrain a bert model using our custom data and increase the vocab size?

Question

I am trying to further pretrain the bert-base model using the custom data. The steps I'm following are as follows:

Generate list of words from the custom data and add these words to the existing bert-base vocab file. The vocab size has been increased from 35022 to 35880.
I created the input data using create_pretraining_data.py from the bert official github page.
Doing the pretrain using run_pretraining.py but facing the mismatch error:

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((35880, 128)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 128]) from checkpoint reader.

Note: I changed the bert_config file with lastest vocab_size as 35880.

Please help me to understand the error and what changes should be made, so that I can pretrain with the custom vocab file.

How to further pretrain a bert model using our custom data and increase the vocab size?

Answers (1)

Related Questions