Reputation:
I was trying to add an additional layer after huggingface bert transformer, so I used BertForSequenceClassification
inside my nn.Module
Network. But, I see the model is giving me random outputs when compared to loading the model directly.
Model 1:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5) # as we have 5 classes
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode(texts[0], add_special_tokens=True, max_length = 512)).unsqueeze(0) # Batch size 1
print(model(input_ids))
Out:
(tensor([[ 0.3610, -0.0193, -0.1881, -0.1375, -0.3208]],
grad_fn=<AddmmBackward>),)
Model 2:
import torch
from torch import nn
class BertClassifier(nn.Module):
def __init__(self):
super(BertClassifier, self).__init__()
self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5)
# as we have 5 classes
# we want our output as probability so, in the evaluation mode, we'll pass the logits to a softmax layer
self.softmax = torch.nn.Softmax(dim = 1) # last dimension
def forward(self, x):
print(x.shape)
x = self.bert(x)
if self.training == False: # in evaluation mode
pass
#x = self.softmax(x)
return x
# create our model
bertclassifier = BertClassifier()
print(bertclassifier(input_ids))
torch.Size([1, 512])
torch.Size([1, 5])
(tensor([[-0.3729, -0.2192, 0.1183, 0.0778, -0.2820]],
grad_fn=<AddmmBackward>),)
They should be the same model, right. I found a similar issue here but no reasonable explanation https://github.com/huggingface/transformers/issues/2770
Does Bert has some ranomized parameter if so how to get reproducible output?
Why the two models give me different outputs? Is there something I'm doing wrong?
Upvotes: 3
Views: 2338
Reputation: 11198
The reason is due to the random initialization of the classifier layer of Bert. If you print your model, you'll see
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=5, bias=True)
)
There is a classifier
in the last layer, this layer is added after bert-base
. Now, the expectation is you'll train this layer for your downstream task.
If you want to get more insight:
model, li = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5, output_loading_info=True) # as we have 5 classes
print(li)
{'missing_keys': ['classifier.weight', 'classifier.bias'], 'unexpected_keys': ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'], 'error_msgs': []}
You can see the classifier.weight
and bias
are missing, so these part will be randomly initialized each time you call BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5)
.
Upvotes: 6