jiangxoo
jiangxoo

Reputation: 13

Vocabulary scale of machine translation

When doing machine translation, if you segment words, such as using BPE, how big is the processed vocabulary?

Upvotes: 0

Views: 129

Answers (2)

Adam Bittlingmayer
Adam Bittlingmayer

Reputation: 1279

You choose the vocabulary size.

You can either make the cut-off absolute, e.g. 100K vocabulary items total, or based on occurences, e.g. only include vocabulary items that occur 10 or more times.

machinetranslate.org/vocabulary:

The vocabulary is affected by choice of tokenisation algorithm or the use of subword models such as byte-pair encoding. In the case of byte-pair encoding, this can cause vocabulary size to become a hyperparameter that affects the generalisation of the model. By 2022, vocabulary sizes with subword models typically ranged from 16000 to 64000.

Upvotes: 0

Jindřich
Jindřich

Reputation: 11220

The BPE algorithm starts with a list of characters in the data and iteratively merges the most frequent symbol pairs. If the algorithm would not have a stopping criterion, you would end up with a vocabulary that covers all words from your training data + all characters + all merges in between the characters and the words.

The reason for using BPE is that we just cannot afford to use a vocabulary that contains all words from the training data: it can easily be millions of word forms. When using BPE, you thus need to say in advance how many merges you want to have. Typically, the number of merges is 20–50k. It ensures that most frequent words remain untouched whereas the less frequent words get split into smaller units. The resulting vocabulary size is then the number of merges + the original alphabet size.

Upvotes: 1

Related Questions