Reputation: 31
With the Transformer model, especially with the BERT, does it make sense (and would it be statistically correct) to programmatically forbid the model to result with the special tokens as predictions? How is that in the original implementations? During convergence the models have to learn not to predict these but would this intervention help (or the opposite)?
Upvotes: 3
Views: 1167
Reputation: 508
If I understand your question, you're asking how BERT (or other transformer based models) handle special characters. This is less relevant to the model architecture (i.e. this answer is relevant to auto-regressive models, or even non-neural models) than it is to the pre-processing steps.
In particular, BERT tokenizers use a Byte-Pair Encoding tokenizer to split the text and tokens into subwords. Should the tokenizer not recognize a sequence of characters, it will replace the sequence of characters with an UNK
meta-token, very much like the MASK
or CLS
tokens. Google will have many answers should you want to see more specifics, for example, from a blog article:
There is an important point to note when we use a pre-trained model. Since the model is pre-trained on a certain corpus, the vocabulary was also fixed. In other words, when we apply a pre-trained model to some other data, it is possible that some tokens in the new data might not appear in the fixed vocabulary of the pre-trained model. This is commonly known as the out-of-vocabulary (OOV) problem.
For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK], which stands for unknown token.
Upvotes: 1