高源伯
高源伯

Reputation: 387

What is the position embedding in the convolutional sequence to sequence learning model?

I don't understand the position embedding in paper Convolutional Sequence to Sequence Learning, anyone can help me?

Upvotes: 8

Views: 6445

Answers (4)

Dylan F
Dylan F

Reputation: 1315

I think khaemuaset's answer is correct.

To reinforce: As I understand from the paper (I'm reading A Convolutional Encoder Model for Machine Translation) and the corresponding Facebook AI Research PyTorch source code, the position embedding is a typical embedding table, but for seq position one-hot vectors instead of vocab one-hot vectors. I verified this with the source code here. Notice the inheritance of nn.Embedding and the call to its forward method at line 32.

The class I linked to is used in the FConvEncoder here.

Upvotes: 0

Aman Gokrani
Aman Gokrani

Reputation: 16

From what I have understood so far is for each word in the sentence they have 2 vectors:

  1. One hot encoding vector to encode the word.
  2. One hot encoding vector to encode the position of the word in the sentence.

Both of these vectors are now passed separately as input and they embed both the inputs into f dimensional space. Once they had activation values from both the inputs which ∈ R^f. They just add these activations to obtain a combined input element representation.

Upvotes: 0

Haozhe Ji
Haozhe Ji

Reputation: 1

From what I perceive, the position embedding is still the procedure of building a low dimension representation for the one-hot vectors. Whereas this time the dimension of the one-hot vector is the length of the sentence. BTY, I think whether placing the 'one hot' in position order really doesn't matter. It just gives the model a sense of 'position awareness'.

Upvotes: 0

khaemuaset
khaemuaset

Reputation: 227

From what I understand, for each word to translate, the input contains both the word itself and its position in the input chain (say, 0, 1, ...m).

Now, encoding such a data with simply having a cell with value pos (in 0..m) would not perform very well (for the same reason we use one-hot vectors to encode words). So, basically, the position will be encoded in a number of input cells, with one-hot representation (or similar, I might think of a binary representation of the position being used).

Then, an embedding layer will be used (just as it is used for word encodings) to transform this sparse and discrete representation into a continuous one.

The representation used in the paper chose to have the same dimension for the word embedding and the position embedding and to simply sum up the two.

Upvotes: 2

Related Questions