Masked language model processing, deeper explanation

Question

I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.

We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.

If someone can explain to me where I'm wrong in my comprehension.

Separius · Accepted Answer

If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.

in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)

Masked language model processing, deeper explanation

Answers (1)

Related Questions