sravani.s
sravani.s

Reputation: 185

How to further pretrain a bert model using our custom data and increase the vocab size?

I am trying to further pretrain the bert-base model using the custom data. The steps I'm following are as follows:

  1. Generate list of words from the custom data and add these words to the existing bert-base vocab file. The vocab size has been increased from 35022 to 35880.

  2. I created the input data using create_pretraining_data.py from the bert official github page.

  3. Doing the pretrain using run_pretraining.py but facing the mismatch error:

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((35880, 128)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 128]) from checkpoint reader.

Note: I changed the bert_config file with lastest vocab_size as 35880.

Please help me to understand the error and what changes should be made, so that I can pretrain with the custom vocab file.

Upvotes: 1

Views: 3450

Answers (1)

Sam Tseng
Sam Tseng

Reputation: 178

You can further pretrain a BERT model with your own data with run_mlm.py at: https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling.

Also look at this: https://github.com/allenai/dont-stop-pretraining and the paper: https://arxiv.org/pdf/2004.10964.pdf for related ideas and terminology: domain-adaptive pertaining and task-adaptive pretraining.

Upvotes: 2

Related Questions