Reputation: 2549

Why use multi-headed attention in Transformers?

I am trying to understand why transformers use multiple attention heads. I found the following quote:

Instead of using a single attention function where the attention can be dominated by the actual word itself, transformers use multiple attention heads.

What is meant by "the attention being dominated by the word itself" and how does the use of multiple heads address that?

Upvotes: 13

Answers (2)

iacob

Reputation: 24331

Transformers were originally proposed, as the title of "Attention is All You Need" implies, as a more efficient seq2seq model ablating the RNN structure commonly used til that point.

However in pursuing this efficiency, a single headed attention had reduced descriptive power compared to RNN based models. Multiple heads were proposed to mitigate this, allowing the model to learn multiple lower-scale feature maps as opposed to one all-encompasing map:

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions [...] This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention...

Attention is All You Need (2017)

As such, multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.

In a CNN this may correspond to different definitions of visual features, and in a Transformer this may correspond to different definitions of relevance:¹

For example:

Architecture	Input	(Layer 1) Kernel/Head 1	(Layer 1) Kernel/Head 2
CNN	Image	Diagonal edge-detection	Horizontal edge-detection
Transformer	Sentence	Attends to next word	Attends from verbs to their direct objects

Notes:

^{There is no guarantee that these are human interpretable, but in many popular architectures they do map accurately onto linguistic concepts:

While no single head performs well at many relations, we find that particular heads correspond remarkably well to particular relations. For example, we find heads that find direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possesive pronouns...

What Does BERT Look at? An Analysis of BERT’s Attention (2019)}

Upvotes: 15

justanyphil

Reputation: 178

Multi-headed attention was introduced due to the observation that different words relate to each other in different ways. For a given word, the other words in the sentence could act as moderating or negating the meaning, but they could also express relations like inheritance (is a kind of), possession (belongs to), etc.

I found this online lecture to be very helpful, which came up with this example:

"The restaurant was not too terrible."

Note that the meaning of the word 'terrible' is distorted by the two words 'too' and 'not' (too: moderation, not: inversion) and 'terrible' also relates to 'restaurant', as it expresses a property.

Upvotes: 6

Why use multi-headed attention in Transformers?

Answers (2)

Related Questions