Reputation: 185
I am trying to further pretrain the bert-base model using the custom data. The steps I'm following are as follows:
Generate list of words from the custom data and add these words to the existing bert-base vocab file. The vocab size has been increased from 35022
to 35880
.
I created the input data using create_pretraining_data.py from the bert official github page.
Doing the pretrain using run_pretraining.py but facing the mismatch error:
ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((35880, 128)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 128]) from checkpoint reader.
Note: I changed the bert_config
file with lastest vocab_size
as 35880
.
Please help me to understand the error and what changes should be made, so that I can pretrain with the custom vocab file.
Upvotes: 1
Views: 3450
Reputation: 178
You can further pretrain a BERT model with your own data with run_mlm.py at: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling.
Also look at this: https://github.com/allenai/dont-stop-pretraining and the paper: https://arxiv.org/pdf/2004.10964.pdf for related ideas and terminology: domain-adaptive pertaining and task-adaptive pretraining.
Upvotes: 2