The definition of "heads" in MultiheadAttention in Pytorch Transformer module

Question

I am a bit confused about the definition of Multihead.
Are [1] and [2] below the same?

[1] My understanding about multiplhead is the multiple attention patterns as below.
"multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder)."
http://jalammar.github.io/illustrated-transformer/

But

[2] in class MultiheadAttention(Module): in Pytorch Transformer module, it seems like embed_dim is DIVIDED by the number of heads.. WHy?

Or... the embed_dim is meant to be the feature dimension times the number of heads in the first place?

self.head_dim = embed_dim // num_heads assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"

https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/activation.py

The definition of "heads" in MultiheadAttention in Pytorch Transformer module

Answers (1)

Related Questions

The definition of &quot;heads&quot; in MultiheadAttention in Pytorch Transformer module

Answers (1)

Related Questions

The definition of "heads" in MultiheadAttention in Pytorch Transformer module