Reputation: 768
I was recently reading the bert source code from the hugging face project. I noticed that the so-called "learnable position encoding" seems to refer to a specific nn.Parameter layer when it comes to implementation.
def __init__(self):
super()
positional_encoding = nn.Parameter()
def forward(self, x):
x += positional_encoding
↑ Could be this feeling, then performed the learnable position encoding. Whether that means it's that simple or not, I'm not sure I understand it correctly, I want to ask someone with experience.
In addition, I noticed a classic bert structure whose location is actually coded only once at the initial input. Does this mean that the subsequent bert layers, for each other, lose the ability to capture location information?
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(...)
...
(pooler): BertPooler(...)
Would I get better results if the results of the previous layer were re-positional encoded before the next BERT layer?
Upvotes: 7
Views: 10602
Reputation: 13
@Shai's answer is quite wonderful. I also had the same wonder and this answer helps me a lot. I'd like to add another nice paper on this topic which provide deep insight into position encoding: Conditional Positional Encodings for Vision Transformers (arXiv 2021).
In Section 4.2, Table 2 CPVT-Ti plus shows better performance compared to CPVT-Ti. CPVT-Ti plus has pos embedding inserted for the first five encoder (instead of only the first encoder for CPVT-Ti). Thus it suggests that your guess that "Would I get better results if the results of the previous layer were re-positional encoded before the next BERT layer?" is probably right.
Upvotes: 1
Reputation: 114986
What is the purpose of positional embeddings?
In transformers (BERT included) the only interaction between the different tokens is done via self-attention layers. If you look closely at the mathematical operation implemented by these layers you will notice that these layers are permutation equivariant: That is, the representation of
"I do like coding"
and
"Do I like coding"
is the same, because the words (=tokens) are the same in both sentences, only their order is different.
As you can see, this "permutation equivariance" is not a desired property in many cases.
To break this symmetry/equivariance one can simply "code" the actual position of each word/token in the sentence. For example:
"I_1 do_2 like_3 coding_4"
is no longer identical to
"Do_1 I_2 like_3 coding_4"
This is the purpose of positional encoding/embeddings -- to make self-attention layers sensitive to the order of the tokens.
Now to your questions:
nn.Parameter
. The position encoding is just a "code" added to each token marking its position in the sequence. Therefore, all it requires is a tensor of the same size as the input sequence with different values per position.Upvotes: 16