TDo
TDo

Reputation: 744

BERT - Pooled output is different from first vector of sequence output

I am using BERT in Tensorflow and there is one detail I dont quite understand. Accordin the the documentation (https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1), pooled output is the of the entire sequence. Based on the original paper, it seems like this is the output for the token "CLS" at the beginning of the setence.

pooled_output[0]

However, when I look at the output corresponding to the first token in the sentence

setence_output[0,0,:]

which I believe corresponds to the token "CLS" (the first token in the sentence), the 2 results are different.

Upvotes: 20

Views: 17917

Answers (4)

Pranav Kushare
Pranav Kushare

Reputation: 41

pooled_output[0] != setence_output[0,0,:]

setence_output : Is simply an array (representation) of the last layer hidden representation of each token which will be of size (batch_size, seq_len, hidden_size)

pooler_output: Is representation/embedding of CLS token passed through some more layers Bertpooler, linear/dense and activation function. It is recommended to use this pooler_output as it contains contextualize information of whole sequence.

Upvotes: 3

Masoud Gheisari
Masoud Gheisari

Reputation: 1497

As mentioned in Huggingface documentation for output of BertModel, pooler output is:

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.

So because of further processed by a Linear layer and a Tanh activation function, the output of first vector of sequence output (CLS token) and pooled output do not have the same values (but the size of both vectors is the same)

Upvotes: 14

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

The intention of pooled_output and sequence_output are different. Since, the embeddings from the BERT model at the output layer are known to be contextual embeddings, the output of the 1st token, i.e, [CLS] token would have captured sufficient context. Hence, the authors of BERT paper found it sufficient to use only the output from the 1st token for few tasks such as classification. They call this output from the single token (i.e, 1st token) as pooled_output.

Since the source code of the TF Hub module is not available, and assuming that the TFHub would use the same implementation as the open-sourced version of the code by authors of BERT (https://github.com/google-research/bert/). As given by the source code of modeling.py script (https://github.com/google-research/bert/blob/bee6030e31e42a9394ac567da170a89a98d2062f/modeling.py), the pooled_output (often called by get_pooled_output() function), returns the activations from the hidden state of the 1st token.

Upvotes: 13

Ley
Ley

Reputation: 71

I encountered similar problem when I was using BertModel from transformers library and I figure your question may be the same. Here’s what I found:

Outputs of BertModel contain a sequence_output (normally of shape [batch_size, max_sequence_length, 768]),which is last layer of Bert. It also has a pooled_output (normally of shape [batch_size, 768]), which is output of an additional “pooler” layer. Pooler layer takes sequence_output[:, 0](first token, i.e. CLS token) followed by dense layer and Tanh activation.

That’s where pooled_output got its name and why it’s different from CLS token, but both should serve the same purpose.

Upvotes: 4

Related Questions