Reputation: 7254
In several academic papers, researchers use the following positional encoding to denote the positioning of elements in a sequence, whether it be a time series-based sequence or words in a sentence for NLP purposes.
My question is how the positioning is actually applied to the data before it is fed to the deep neural network (in my case a transformer network):
MultiHeadAttention
layer actually already contain an Embeeding
layer that takes care of the positional encoding? Or not?I am interested in actual implementation details not the conceptual part of positional encoding as I read most of the academic papers on positional encoding already. Unfortunately, most academic papers fall short of describing in detail at what stage and how precisely the positional encoding is applied to the data structure.
Thanks!!!
Upvotes: 3
Views: 7973
Reputation: 499
Positional encoding is just a way to let the model differentiates two elements (words) that're the same but which appear in different positions in a sequence.
After applying embeddings in a LM - language model for example, we add PE to add an information about position of each word.
Are the positional values added directly to the actual values of the elements in the sequence (or to the word representation values)? Or are they concatinated? Is the positional embedding part of the data preprocessing stage?
Yes PE values are just added directly to actual values (embeddings in a LM). This will results that the embedding vector of the word a
that appears in the beginning of the sequence will be different of the embedding vector of the same word that appears in the middle of the sequence. And no, PE is not a part of data preprocessing stage.
Here's an example of code:
class PositionalEncodingLayer(nn.Module):
def __init__(self, d_model, max_len=100):
super(PositionalEncodingLayer, self).__init__()
self.d_model = d_model
self.max_len = max_len
def get_angles(self, positions, indexes):
d_model_tensor = torch.FloatTensor([[self.d_model]]).to(positions.device)
angle_rates = torch.pow(10000, (2 * (indexes // 2)) / d_model_tensor)
return positions / angle_rates
def forward(self, input_sequences):
"""
:param Tensor[batch_size, seq_len] input_sequences
:return Tensor[batch_size, seq_len, d_model] position_encoding
"""
positions = torch.arange(input_sequences.size(1)).unsqueeze(1).to(input_sequences.device) # [seq_len, 1]
indexes = torch.arange(self.d_model).unsqueeze(0).to(input_sequences.device) # [1, d_model]
angles = self.get_angles(positions, indexes) # [seq_len, d_model]
angles[:, 0::2] = torch.sin(angles[:, 0::2]) # apply sin to even indices in the tensor; 2i
angles[:, 1::2] = torch.cos(angles[:, 1::2]) # apply cos to odd indices in the tensor; 2i
position_encoding = angles.unsqueeze(0).repeat(input_sequences.size(0), 1, 1) # [batch_size, seq_len, d_model]
return position_encoding
class InputEmbeddingAndPositionalEncodingLayer(nn.Module):
def __init__(self, vocab_size, max_len, d_model, dropout):
super(InputEmbeddingAndPositionalEncodingLayer, self).__init__()
self.vocab_size = vocab_size
self.max_len = max_len
self.d_model = d_model
self.dropout = nn.Dropout(p=dropout)
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_encoding = PositionalEncodingLayer(d_model=d_model, max_len=max_len)
def forward(self, sequences):
"""
:param Tensor[batch_size, seq_len] sequences
:return Tensor[batch_size, seq_len, d_model]
"""
token_embedded = self.token_embedding(sequences) # [batch_size, seq_len, d_model]
position_encoded = self.position_encoding(sequences) # [batch_size, seq_len, d_model]
return self.dropout(token_embedded) + position_encoded # [batch_size, seq_len, d_model]
Does the Tensorflow/Keras MultiHeadAttention layer actually already contain an Embeeding layer that takes care of the positional encoding? Or not?
Simply No. You have to build PE yourself.
What about the normalization of data? Are only the actual element values normalized and then the positional encoding is added to that normalized value? Or is the positional encoding value added to the raw value of the element and the resulting values are normalized?
The normalization part is at your discretion. You do what you want. But you should apply the normalization. Also, PE is added to normalized values not actual one.
Upvotes: 4