Concept of Bucketing in Seq2Seq model

To handle sequences of different lengths we use bucketing and padding. In bucketing we make different bucket for some max_len and we do this to reduce the amount of padding, after making different buckets we train different model on different bucket.

This is what I found so far. But what I don't understand is that how this all different models trained and how they are used for translating a new sentence?

Upvotes: 4

Answers (1)

Maxim

Reputation: 53768

Both at training and inference time, the algorithm needs to pick the network that is best suited for the current input sentence (or batch). Usually, it simply takes the minimal bucket which input size is greater or equal to the sentence length.

For example, suppose there are just two buckets [10, 16] and [20, 32]: the first one takes any input up to length 10 (padded to exactly 10) and outputs the translated sentence up to length 16 (padded to 16). Likewise the second bucket handles the inputs up to length 20. The two networks corresponding to these buckets accept non-intersecting input sets.

Then, for the sentence of length 8, it's better to select the first bucket. Note that if this is a test sentence, the second bucket can handle it as well, but in this case its neural network had been trained on bigger sentences, from 11 to 20 words, so it's likely not to recognize this sentence well. The network that corresponds to the first bucket had been trained on inputs 1 to 10, hence is a better choice.

You may be in trouble if the test sentence has the length 25, longer than any available bucket. There's no universal solution here. The best course of action here is to trim the input to 20 words and try to translate anyway.

Upvotes: 11

Concept of Bucketing in Seq2Seq model

Answers (1)

Related Questions