Reputation: 919
In Attention Is All You Need, the authors implement a positional embedding (which adds information about where a word is in a sequence). For this, they use a sinusoidal embedding:
PE(pos,2i) = sin(pos/10000**(2*i/hidden_units))
PE(pos,2i+1) = cos(pos/10000**(2*i/hidden_units))
where pos is the position and i is the dimension. It must result in an embedding matrix of shape [max_length, embedding_size], i.e., given a position in a sequence, it returns the tensor of PE[position,:].
I found the Kyubyong's implementation, but I do not fully understand it.
I tried to implement it in numpy the following way:
hidden_units = 100 # Dimension of embedding
vocab_size = 10 # Maximum sentence length
# Matrix of [[1, ..., 99], [1, ..., 99], ...]
i = np.tile(np.expand_dims(range(hidden_units), 0), [vocab_size, 1])
# Matrix of [[1, ..., 1], [2, ..., 2], ...]
pos = np.tile(np.expand_dims(range(vocab_size), 1), [1, hidden_units])
# Apply the intermediate funcitons
pos = np.multiply(pos, 1/10000.0)
i = np.multiply(i, 2.0/hidden_units)
matrix = np.power(pos, i)
# Apply the sine function to the even colums
matrix[:, 1::2] = np.sin(matrix[:, 1::2]) # even
# Apply the cosine function to the odd columns
matrix[:, ::2] = np.cos(matrix[:, ::2]) # odd
# Plot
im = plt.imshow(matrix, cmap='hot', aspect='auto')
I don't understand how this matrix can give information on the position of inputs. Could someone first tell me if this is the right way to compute it and second what is the rationale behind it?
Thank you.
Upvotes: 10
Views: 13719
Reputation: 66
Disclaimer: I have not read the paper. I just came across this formula from my colleague and that got me thinking
Reproducing the formulas from the question here for your convenience:
PE(pos,2i) = sin(pos/10000**(2*i/hidden_units))
PE(pos,2i+1) = cos(pos/10000**(2*i/hidden_units))
For position 0 (pos=0), we have alternating SIN(0) and COS(0) as the position embedding i.e. for position 0, the scheme just differentiates the odd and even dimension in the position embedding.
Before looking at other positions:
If you observe the denominator term 10000**(2*i/hidden_units)
, it is a power series of 10000 from 0(i=0) to 2(i=hidden_units). Consider the case when i >= (hidden_units/2)
. Observe that the power of 10,000 is > 1 for such an i
. If "pos <<< 10000", then the case degenerates to SIN(0) and COS(0). Thus for higher dimensions of the position embedding (regardless of the pos), this scheme really does not differentiate much between successive dimensions except possibly that they are odd/even dimensions.
This also opens our eyes to another way of looking at position embedding. We now know that the lower dimensions of the position embedding are more sensitive to "pos" than the higher dimensions of the position embedding. So, it might be worthwhile to look at how a dimension of the position embedding is changing with respect to different positions.
We can look at SIN and COS functions as SIN(wt) and COS(wt) where "pos" is "t" and (1/d**(2*i/hidden_units))
is "w"
One can observe that as i
ranges from 0
to hidden_units
, the frequency decreases from 1 to a very small number.
This is in line with our observation above that higher dimensions of embedding are not sensitive to "pos" (less-frequency wave)
TL/DR
Now, we can see the dimensions of the embeddings are samplings of sinusoidal waves of decreasing frequencies.
How does this encode position?
The sinusoidal waves vary smoothly according to its frequency. By using waves of different frequencies, we are encoding different similarities of positions in different dimensions. This allows the network to choose and pick the necessary similarity from whichever dimension it seems fit for a given word and sentence.
Whats up with SIN and COS terms? I believe they are there to induce some variance. Perhaps, one of you could correct me and clarify
What if pos is close to hidden_units Well, these are now hyper-parameters of your model. I think the scheme allows us to experiment and find out what works best. Perhaps, one of you could exposit better.
Upvotes: 0
Reputation: 919
I found the answer in a pytorch implementation:
# keep dim 0 for padding token position encoding zero vector
position_enc = np.array([
[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
return torch.from_numpy(position_enc).type(torch.FloatTensor)
where d_pos_vec is the embedding dimension and n_position the max sequence length.
EDIT:
In the paper, the authors say that this representation of the embedding matrix allows "the model to extrapolate to sequence lengths longer than the ones encountered during training".
The only difference between two positions is the pos
variable. Check the image below for a graphical representation.
Upvotes: 11