Reputation: 387
I don't understand the position embedding in paper Convolutional Sequence to Sequence Learning, anyone can help me?
Upvotes: 8
Views: 6445
Reputation: 1315
I think khaemuaset's answer is correct.
To reinforce: As I understand from the paper (I'm reading A Convolutional Encoder Model for Machine Translation) and the corresponding Facebook AI Research PyTorch source code, the position embedding is a typical embedding table, but for seq position one-hot vectors instead of vocab one-hot vectors. I verified this with the source code here. Notice the inheritance of nn.Embedding
and the call to its forward
method at line 32.
The class I linked to is used in the FConvEncoder here.
Upvotes: 0
Reputation: 16
From what I have understood so far is for each word in the sentence they have 2 vectors:
Both of these vectors are now passed separately as input and they embed both the inputs into f dimensional space. Once they had activation values from both the inputs which ∈ R^f. They just add these activations to obtain a combined input element representation.
Upvotes: 0
Reputation: 1
From what I perceive, the position embedding is still the procedure of building a low dimension representation for the one-hot vectors. Whereas this time the dimension of the one-hot vector is the length of the sentence. BTY, I think whether placing the 'one hot' in position order really doesn't matter. It just gives the model a sense of 'position awareness'.
Upvotes: 0
Reputation: 227
From what I understand, for each word to translate, the input contains both the word itself and its position in the input chain (say, 0, 1, ...m).
Now, encoding such a data with simply having a cell with value pos (in 0..m) would not perform very well (for the same reason we use one-hot vectors to encode words). So, basically, the position will be encoded in a number of input cells, with one-hot representation (or similar, I might think of a binary representation of the position being used).
Then, an embedding layer will be used (just as it is used for word encodings) to transform this sparse and discrete representation into a continuous one.
The representation used in the paper chose to have the same dimension for the word embedding and the position embedding and to simply sum up the two.
Upvotes: 2