Reputation: 2877

How is the self-attention mechanism in Transformers able to learn how the words are related to each other?

Given the sentence The animal didn't cross the street because it was too tired, how the self-attention is able to map with a higher score the word aninal intead of the word street ?

I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words.

Upvotes: 1

Answers (1)

Shadman Rohan

Reputation: 1

Word Embeddings are first added to Positional Encoding which adds information about the word's position in the sequence. Then through each Encoder stack(6 to be precise), the Embeddings undergo multiple transformations and are refined to form better representations before they are passed on to the decoder.

The modification to the Embeddings as it passes through the Encoder Stack is learnable. Sometimes it may appear that some Attention-Heads at the top Stack are doing something that may look like coreference resolution which you pointed out in your example. Attending more to the word "animal" simply results in better representation than attending to "street".

How do we know which representations are better? The one that minimizes the loss or produces a better output of course!

Upvotes: 0

How is the self-attention mechanism in Transformers able to learn how the words are related to each other?

Answers (1)

Related Questions