Can you train a BERT model from scratch with task specific architecture?

Question

BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.

However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?

Wasi Ahmad · Accepted Answer

BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.

However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.

I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.

Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.

Can you train a BERT model from scratch with task specific architecture?

Answers (2)

Related Questions