Reputation: 715
How exactly does this positional encoding being calculated?
Let's assume a machine translation scenario and these are input sentences,
english_text = [this is good, this is bad]
german_text = [das ist gut, das ist schlecht]
Now our input vocabulary size is 4 and embedding dimension is 4.
#words #embeddings
this - [0.5, 0.2, 0.3, 0.1]
is - [0.1, 0.2, 0.5, 0.1]
good - [0.9, 0.7, 0.9, 0.1]
bad - [0.7, 0.3, 0.4, 0.1]
As per transformer paper we add the each word position encoding with each word embedding and then pass it to encoder like seen in the image below,
As far as the paper is concerned they given this formula for calculating position encoding of each word,
So, this is how I think I can implement it,
d_model = 4 # Embedding dimension
positional_embeddings = np.zeros((max_sentence_length, d_model))
max_sentence_length = 3 # as per my examples above
for position in range(maximum_sentence_length):
for i in range(0, d_model, 2):
positional_embeddings[position, i] = (
sin(position / (10000 ** ( (2*i) / d_model) ) )
)
positional_embeddings[position, i + 1] = (
cos(position / (10000 ** ( (2 * (i + 1) ) / d_model) ) )
)
Then, the new embedding vector will be
[[0.5, 0.2, 0.3, 0.1],
[0.1, 0.2, 0.5, 0.1],
[0.9, 0.7, 0.9, 0.1]] + positional_embeddings = NEW EMBEDDINGS
## shapes
3 x 4 + 3 x 4 = 3 x 4
Is this how the calculation will be carried out in the implementation? Do correct me if there's any mistake in my above pseudo implementation.
If everything is correct then I have three doubts hope someone can clear them,
1) From the above implementation we are using sin formula for even positions and cos formula for odd positions but I couldn't understand the reason behind it? I read that it's taking use of cyclic properties but couldn't understand it.
2) Is there a reason behind choosing 10000/(2i/d)
or 10000/(2i+1/d)
as scaling factor in formula.
3) All the sentence will not be equal to max sentence length so we might have to padded the sentence so do we also calculate positional encondings to padding tokens.
Upvotes: 4
Views: 4488
Reputation: 11240
Your implementation is basically correct. The typical implementation is pre-computing the embedding matrix, make a non-trainable embedding layer, and do an embedding lookup of a range. See e.g. the implementation in HuggingFace's Transformers.
Some hints about the intuition behind the equations are in these threads:
But it seems to me that pretty much all decisions about the position encoding were empirical choices.
By cyclic properties, they IMHO mean that given a dimension of the embedding, the difference of the embedding values between positions with a constant offset is the same regardless of the position in the sequence. For that, using either only sine or cosine might be enough, but some positions would have a much larger norm that the others, therefore they alternate sine and cosine.
I think the scaling factors are empirically estimated to cover the usual length of sentences.
With padding, you indeed consider also the positional encoding of the padded positions, but since they are pre-computed, it does mean higher computation load because you get the embeddings for the padding symbols anyway.
Upvotes: 4