Reputation: 1514
I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like:
With a process like translation, why is it important for encoder's hidden states to feed forward or exist in the first place? We already know what the next x is going to be. Thereby, the order of the input isn't necessarily important for the order of the output, neither is what has been memorized from the previous input as the attention model looks at all inputs simulaneously. Couldn't you just use attention directly on the embedding of x?
Thank you!
Upvotes: 0
Views: 216
Reputation: 11240
You can easily try and see that you will get quite bad results. Even you added some positional encoding to the input embeddings, the result will be pretty bad.
The order matters. Sentences:
indeed have a different meaning. Also, the order is not the only information you get from the encoder. The encoder does also input disambiguation: words can be homonymous such as "train" (see https://arxiv.org/pdf/1908.11771.pdf). Also, the probing of trained neural networks shows that the encoder develops a pretty abstract representation of the input sentence (see https://arxiv.org/pdf/1911.00317.pdf) and a large part of the translation actually already happens in the encoder (see https://arxiv.org/pdf/2003.09586.pdf).
Upvotes: 1