Positional Encoding for time series based data for Transformer DNN models

Question

In several academic papers, researchers use the following positional encoding to denote the positioning of elements in a sequence, whether it be a time series-based sequence or words in a sentence for NLP purposes.

My question is how the positioning is actually applied to the data before it is fed to the deep neural network (in my case a transformer network):

Are the positional values added directly to the actual values of the elements in the sequence (or to the word representation values)? Or are they concatinated? Is the positional embedding part of the data preprocessing stage?
Does the Tensorflow/Keras MultiHeadAttention layer actually already contain an Embeeding layer that takes care of the positional encoding? Or not?
What about the normalization of data? Are only the actual element values normalized and then the positional encoding is added to that normalized value? Or is the positional encoding value added to the raw value of the element and the resulting values are normalized?

I am interested in actual implementation details not the conceptual part of positional encoding as I read most of the academic papers on positional encoding already. Unfortunately, most academic papers fall short of describing in detail at what stage and how precisely the positional encoding is applied to the data structure.

Thanks!!!

Dimitri Sifoua · Accepted Answer

Positional encoding is just a way to let the model differentiates two elements (words) that're the same but which appear in different positions in a sequence.

After applying embeddings in a LM - language model for example, we add PE to add an information about position of each word.

Are the positional values added directly to the actual values of the elements in the sequence (or to the word representation values)? Or are they concatinated? Is the positional embedding part of the data preprocessing stage?

Yes PE values are just added directly to actual values (embeddings in a LM). This will results that the embedding vector of the word a that appears in the beginning of the sequence will be different of the embedding vector of the same word that appears in the middle of the sequence. And no, PE is not a part of data preprocessing stage.

Here's an example of code:

class PositionalEncodingLayer(nn.Module):
    
    def __init__(self, d_model, max_len=100):
        super(PositionalEncodingLayer, self).__init__()
        self.d_model = d_model
        self.max_len = max_len
    
    def get_angles(self, positions, indexes):
        d_model_tensor = torch.FloatTensor([[self.d_model]]).to(positions.device)
        angle_rates = torch.pow(10000, (2 * (indexes // 2)) / d_model_tensor)
        return positions / angle_rates

    def forward(self, input_sequences):
        """
        :param Tensor[batch_size, seq_len] input_sequences
        :return Tensor[batch_size, seq_len, d_model] position_encoding
        """
        positions = torch.arange(input_sequences.size(1)).unsqueeze(1).to(input_sequences.device) # [seq_len, 1]
        indexes = torch.arange(self.d_model).unsqueeze(0).to(input_sequences.device) # [1, d_model]
        angles = self.get_angles(positions, indexes) # [seq_len, d_model]
        angles[:, 0::2] = torch.sin(angles[:, 0::2]) # apply sin to even indices in the tensor; 2i
        angles[:, 1::2] = torch.cos(angles[:, 1::2]) # apply cos to odd indices in the tensor; 2i
        position_encoding = angles.unsqueeze(0).repeat(input_sequences.size(0), 1, 1) # [batch_size, seq_len, d_model]
        return position_encoding

class InputEmbeddingAndPositionalEncodingLayer(nn.Module):

    def __init__(self, vocab_size, max_len, d_model, dropout):
        super(InputEmbeddingAndPositionalEncodingLayer, self).__init__()
        self.vocab_size = vocab_size
        self.max_len = max_len
        self.d_model = d_model
        self.dropout = nn.Dropout(p=dropout)
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_encoding = PositionalEncodingLayer(d_model=d_model, max_len=max_len)

    def forward(self, sequences):
        """
        :param Tensor[batch_size, seq_len] sequences
        :return Tensor[batch_size, seq_len, d_model]
        """
        token_embedded = self.token_embedding(sequences) # [batch_size, seq_len, d_model]
        position_encoded = self.position_encoding(sequences) # [batch_size, seq_len, d_model]
        return self.dropout(token_embedded) + position_encoded # [batch_size, seq_len, d_model]

Does the Tensorflow/Keras MultiHeadAttention layer actually already contain an Embeeding layer that takes care of the positional encoding? Or not?

Simply No. You have to build PE yourself.

What about the normalization of data? Are only the actual element values normalized and then the positional encoding is added to that normalized value? Or is the positional encoding value added to the raw value of the element and the resulting values are normalized?

The normalization part is at your discretion. You do what you want. But you should apply the normalization. Also, PE is added to normalized values not actual one.

Positional Encoding for time series based data for Transformer DNN models

Answers (1)

Related Questions