Normalization of token embeddings in BERT encoder blocks

Following the multi-headed attention layer in a BERT encoder block, is layer normalization done separately on the embedding of each token (i.e., one mean and variance per token embedding), or on the concatenated vector of all token embeddings (the same mean and variance for all embeddings)?

Upvotes: 2

Answers (2)

Fijoy Vadakkumpadan

Reputation: 668

I tracked down full details of layer normalization (LN) in BERT here.

Mean and variance are computed per token. But the weight and bias parameters learned in LN are not per token - it's per embedding dimension.

Upvotes: 0

Parman M. Alizadeh

Reputation: 1553

Layer normalization is applied to each token's embedding individually. This means each token gets its own normalization, based on its specific features. This helps to ensure that the model can process each token effectively, regardless of the other tokens in the sequence.

BERT differs from the original Transformer architecture in the placement of layer normalization. In BERT, it's applied before the self-attention mechanism, while in the original Transformer, it's applied after. This subtle difference can have a significant impact on the model's performance. (see here)

Update: Please refer to the following paper: On Layer Normalization in the Transformer Architecture. The authors explored both approaches of applying Layer Normalization before and after attention layer(namely Pre-LN and Post-LN) in BERT. Their results indicate that using Layer Normalization before the attention layer yields better results. For a summarized review of the same paper, you can see here. Overall, you might find different BERT diagrams in which each used a different approach of using Layer Normalization.

Upvotes: 2

Normalization of token embeddings in BERT encoder blocks

Answers (2)

Related Questions