Why does padding of 'max_length' in Huggingface cause much slower model inference?

I have trained a bert-based-uncase AutoModelForSequenceClassification model and found that model inference is at least 2x faster if I comment out padding = ‘max_length’ in the encode step.
My understanding is that BERT expects a fixed length of 512 tokens, doesn’t that imply input must be padded to 512?

sequence = tokenizer.encode_plus(question,
                                        passage,
                                        max_length = 256,
                                        padding = 'max_length',
                                        truncation = \"longest_first\",
                                        return_tensors=\"pt\")['input_ids'].to(device)

Upvotes: 2

Answers (1)

Berkay Berabi

Reputation: 2348

My understanding is BERT expects a fix length of 512 tokens… doesn’t that imply input must be padded to 512?

No, this is not true. BERT has a maximum input length of 512, but this does not imply that every input must be of length 512. It only means that it can not handle longer inputs, and any input longer than 512 will be truncated to have the size of 512.

Requirement for batches

The BERT model and pretty much all of the neural networks that deal with sequences expect the elements in a batch to be of the same size. The reason for that is they operate on tensors/matrices, and these, by definition, can not contain rows with variable lengths. All the rows must be of the same length. This requirement can be seen as a simple mathematical requirement or also as an implementation detail.

Static padding

This is the option where all the elements in the batch are padded to the maximum length of the model. It is simple and works but has the disadvantage that it introduces a lot of unnecessary computation. Remember we talked a bout the tensors and matrices? Essentially, you are increasing the size of the tensors that the model operates on and this increases the computational complexity.

In HuggingFace, this corresponds to padding="max_length"

Dynamic Padding

To overcome the issues with static padding, dynamic padding was introduced. The idea is simple. There is no requirement that all the batches have the same size, meaning each batch can have a different maximum size. The neural network will not complain as long as all the elements in the batch have the same size. So, instead of padding each element in a batch to the maximum length, you can pad them to the longest length in the batch. For instance, if there are 4 elements in a batch with lengths 10, 30, 78, 89. All of them will get padded to size 89 under dynamic padding, whereas they would get padded to 512 under static padding.

In HuggingFace, this corresponds to padding="longest"

Why did it work for me without specifying any padding length?

I think that your input question contains a single element. So, it is a batch with a single element. Therefore, it does not matter how you pad it or if you pad it at all.

You can find more on this page: https://huggingface.co/course/en/chapter3/2?fw=pt#dynamic-padding

Upvotes: 4

Why does padding of &#39;max_length&#39; in Huggingface cause much slower model inference?