Vivek Subramanian
Vivek Subramanian

Reputation: 1234

HuggingFace BERT `inputs_embeds` giving unexpected result

The HuggingFace BERT TensorFlow implementation allows us to feed in a precomputed embedding in place of the embedding lookup that is native to BERT. This is done using the model's call method's optional parameter inputs_embeds (in place of input_ids). To test this out, I wanted to make sure that if I did feed in BERT's embedding lookup, I would get the same result as having fed in the input_ids themselves.

The result of BERT's embedding lookup can be obtained by setting the BERT configuration parameter output_hidden_states to True and extracting the first tensor from the last output of the call method. (The remaining 12 outputs correspond to each of the 12 hidden layers.)

Thus, I wrote the following code to test my hypothesis:

import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)

result = bert_model(inputs={'input_ids': input_ids, 
                            'attention_mask': attention_mask, 
                             'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0]
result2 = bert_model(inputs={'inputs_embeds': inputs_embeds, 
                            'attention_mask': attention_mask, 
                             'token_type_ids': token_type_ids})

print(tf.reduce_sum(tf.abs(result[0] - result2[0])))  # 458.2522, should be 0

Again, the output of the call method is a tuple. The first element of this tuple is the output of the last layer of BERT. Thus, I expected result[0] and result2[0] to match. Why is this not the case?

I am using Python 3.6.10 with tensorflow version 2.1.0 and transformers version 2.5.1.

EDIT: Looking at some of the HuggingFace code, it seems that the raw embeddings that are looked up when input_ids is given or assigned when inputs_embeds is given are added to the positional embeddings and token type embeddings before being fed into subsequent layers. If this is the case, then it may be possible that what I'm getting from result[-1][0] is the raw embedding plus the positional and token type embeddings. This would mean that they are erroneously getting added in again when I feed result[-1][0] as inputs_embeds in order to calculate result2.

Could someone please tell me if this is the case and if so, please explain how to get the positional and token type embeddings, so I can subtract them out? Below is what I came up with for positional embeddings based on the equations given here (but according to the BERT paper, the positional embeddings may actually be learned, so I'm not sure if these are valid):

import numpy as np

positional_embeddings = np.stack([np.zeros(shape=(len(sent),768)) for sent in input_ids])
for s in range(len(positional_embeddings)):
    for i in range(len(positional_embeddings[s])):
        for j in range(len(positional_embeddings[s][i])):
            if j % 2 == 0:
                positional_embeddings[s][i][j] = np.sin(i/np.power(10000., j/768.))
            else:
                positional_embeddings[s][i][j] = np.cos(i/np.power(10000., (j-1.)/768.))
positional_embeddings = tf.constant(positional_embeddings)
inputs_embeds += positional_embeddings

Upvotes: 8

Views: 3563

Answers (1)

Vivek Subramanian
Vivek Subramanian

Reputation: 1234

My intuition about positional and token type embeddings being added in turned out to be correct. After looking closely at the code, I replaced the line:

inputs_embeds = result[-1][0]

with the lines:

embeddings = bert_model.bert.get_input_embeddings().word_embeddings
inputs_embeds = tf.gather(embeddings, input_ids)

Now, the difference is 0.0, as expected.

Upvotes: 3

Related Questions