Reputation: 2549
I am trying to understand why transformers use multiple attention heads. I found the following quote:
Instead of using a single attention function where the attention can be dominated by the actual word itself, transformers use multiple attention heads.
What is meant by "the attention being dominated by the word itself" and how does the use of multiple heads address that?
Upvotes: 13
Views: 6582
Reputation: 24331
Transformers were originally proposed, as the title of "Attention is All You Need" implies, as a more efficient seq2seq model ablating the RNN structure commonly used til that point.
However in pursuing this efficiency, a single headed attention had reduced descriptive power compared to RNN based models. Multiple heads were proposed to mitigate this, allowing the model to learn multiple lower-scale feature maps as opposed to one all-encompasing map:
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions [...] This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention...
- Attention is All You Need (2017)
As such, multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.
In a CNN this may correspond to different definitions of visual features, and in a Transformer this may correspond to different definitions of relevance:1
For example:
Architecture | Input | (Layer 1) Kernel/Head 1 |
(Layer 1) Kernel/Head 2 |
---|---|---|---|
CNN | Image | Diagonal edge-detection | Horizontal edge-detection |
Transformer | Sentence | Attends to next word | Attends from verbs to their direct objects |
Notes:
While no single head performs well at many relations, we find that particular heads correspond remarkably well to particular relations. For example, we find heads that find direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possesive pronouns...
Upvotes: 15
Reputation: 178
Multi-headed attention was introduced due to the observation that different words relate to each other in different ways. For a given word, the other words in the sentence could act as moderating or negating the meaning, but they could also express relations like inheritance (is a kind of), possession (belongs to), etc.
I found this online lecture to be very helpful, which came up with this example:
"The restaurant was not too terrible."
Note that the meaning of the word 'terrible' is distorted by the two words 'too' and 'not' (too: moderation, not: inversion) and 'terrible' also relates to 'restaurant', as it expresses a property.
Upvotes: 6