Reputation: 1152
I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.
We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.
If someone can explain to me where I'm wrong in my comprehension.
Upvotes: 0
Views: 1077
Reputation: 1296
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
Upvotes: 1