Reputation: 13
When doing machine translation, if you segment words, such as using BPE, how big is the processed vocabulary?
Upvotes: 0
Views: 129
Reputation: 1279
You choose the vocabulary size.
You can either make the cut-off absolute, e.g. 100K vocabulary items total, or based on occurences, e.g. only include vocabulary items that occur 10 or more times.
machinetranslate.org/vocabulary:
The vocabulary is affected by choice of tokenisation algorithm or the use of subword models such as byte-pair encoding. In the case of byte-pair encoding, this can cause vocabulary size to become a hyperparameter that affects the generalisation of the model. By 2022, vocabulary sizes with subword models typically ranged from 16000 to 64000.
Upvotes: 0
Reputation: 11220
The BPE algorithm starts with a list of characters in the data and iteratively merges the most frequent symbol pairs. If the algorithm would not have a stopping criterion, you would end up with a vocabulary that covers all words from your training data + all characters + all merges in between the characters and the words.
The reason for using BPE is that we just cannot afford to use a vocabulary that contains all words from the training data: it can easily be millions of word forms. When using BPE, you thus need to say in advance how many merges you want to have. Typically, the number of merges is 20–50k. It ensures that most frequent words remain untouched whereas the less frequent words get split into smaller units. The resulting vocabulary size is then the number of merges + the original alphabet size.
Upvotes: 1