Reputation: 25
I try to use hugging face transformers api. As I import library , I have some questions. If anyone who know the answer, please tell me your knowledge.
transformers library have several models that are trained. transformers provide not only bare model like 'BertModel, RobertaModel, ... but also convenient heads like 'ModelForMultipleChoice' , 'ModelForSequenceClassification', 'ModelForTokenClassification' , ModelForQuestionAnswering.
I wonder what's difference between bare model adding new linear transformation myself and modelforsequenceclassification. what's different custom model (pretrained model with random intialized linear) and transformers modelforsequenceclassification.
is ModelforSequenceClassification trained from glue data? I look forward to someone's reply Thanks.
Upvotes: 1
Views: 1760
Reputation: 11488
I think it's easiest to understand if we have a look at the actual implementation, where I randomly chose RobertaModel
and RobertaForSequenceClassification
as an example. However, the conclusion is valid for all other models, too.
You can find the implementation for RobertaForSequenceClassification
here, which looks roughly like this:
class RobertaForSequenceClassification(RobertaPreTrainedModel):
authorized_missing_keys = [r"position_ids"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.roberta = RobertaModel(config, add_pooling_layer=False)
self.classifier = RobertaClassificationHead(config)
self.init_weights()
[...]
def forward([...]):
[...]
As we can see, there is no indication about the pretraining here, and it simply adds another linear layer on top (the implementation of the RobertaClassificationHead
can be found a bit further down, namely here):
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
So, to answer your question: These models come without any pretrained additional layers on top, and you could easily implement them yourself*.
Now for the asterisk: While it could be easy to wrap this yourself, also note that it is an inherited class RobertaPreTrainedModel
. This has several advantages, the most important one being a consistent design between different implementations (sequence classification model, sequence tagging model, etc.). Further, there are some neat functionalities that they are providing, like the forward call including extensive parameters (padding, masking, attention output, ...), which would cost quite some time to implement.
Last but not least, there are existing trained models based on these specific implementations, which you can search for on the Huggingface Model Hub. There, you might find models that are fine-tuned on a sequence classification task (e.g., this one), and then directly load its weights in a RobertaForSequenceClassification
model. If you had your own implementation of a sequence classification model, loading and aligning these pre-trained weights would be incredibly more complicated.
I hope this answers your main concern, but feel free to elaborate (either as comment or new question) on any points that have not been addressed!
Upvotes: 4