Reputation: 31

Transformer/BERT token prediction vocabulary (filtering the special tokens out of the set of possible tokens)

With the Transformer model, especially with the BERT, does it make sense (and would it be statistically correct) to programmatically forbid the model to result with the special tokens as predictions? How is that in the original implementations? During convergence the models have to learn not to predict these but would this intervention help (or the opposite)?

I would consider the [MASK], [CLS] tokens mainly
[PAD] token could have some sense as well (but that not in all situations)

Upvotes: 3

Answers (1)

Alex L

Reputation: 508

If I understand your question, you're asking how BERT (or other transformer based models) handle special characters. This is less relevant to the model architecture (i.e. this answer is relevant to auto-regressive models, or even non-neural models) than it is to the pre-processing steps.

In particular, BERT tokenizers use a Byte-Pair Encoding tokenizer to split the text and tokens into subwords. Should the tokenizer not recognize a sequence of characters, it will replace the sequence of characters with an UNK meta-token, very much like the MASK or CLS tokens. Google will have many answers should you want to see more specifics, for example, from a blog article:

There is an important point to note when we use a pre-trained model. Since the model is pre-trained on a certain corpus, the vocabulary was also fixed. In other words, when we apply a pre-trained model to some other data, it is possible that some tokens in the new data might not appear in the fixed vocabulary of the pre-trained model. This is commonly known as the out-of-vocabulary (OOV) problem.

For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK], which stands for unknown token.

Upvotes: 1

Transformer/BERT token prediction vocabulary (filtering the special tokens out of the set of possible tokens)

Answers (1)

Related Questions