Attention network without hidden state?

Question

I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like:

x: Input.
h: Encoder's hidden state which feeds forward to the next encoder's hidden state.
s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state.
y: Output.

With a process like translation, why is it important for encoder's hidden states to feed forward or exist in the first place? We already know what the next x is going to be. Thereby, the order of the input isn't necessarily important for the order of the output, neither is what has been memorized from the previous input as the attention model looks at all inputs simulaneously. Couldn't you just use attention directly on the embedding of x?

Thank you!

Jindřich · Accepted Answer

You can easily try and see that you will get quite bad results. Even you added some positional encoding to the input embeddings, the result will be pretty bad.

The order matters. Sentences:

John loves Marry.
Marry loves John.

indeed have a different meaning. Also, the order is not the only information you get from the encoder. The encoder does also input disambiguation: words can be homonymous such as "train" (see https://arxiv.org/pdf/1908.11771.pdf). Also, the probing of trained neural networks shows that the encoder develops a pretty abstract representation of the input sentence (see https://arxiv.org/pdf/1911.00317.pdf) and a large part of the translation actually already happens in the encoder (see https://arxiv.org/pdf/2003.09586.pdf).

Attention network without hidden state?

Answers (1)

Related Questions