Reputation: 43
I read lots of articles and people are saying BERT is good for NLU while GPT is good for NLG. But the key difference in structure between them is just adding a mask or not in self-attention, and trained the model in different ways.
From the code below, if I understand correctly, we are free to choose add an attention mask or not. https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/modeling_gpt2.py
So can I come to the conclusion that it should "the pretrained parameters for BERT is good for NLU" and "the pretrained parameters for GPT2 is good for NLG"? Or is there any other critical difference between these two that make people come to this conclusion I mentioned at the beginning?
Upvotes: 1
Views: 1760
Reputation: 7379
BERT and GPT are trained on different training objectives and for different purposes.
BERT is trained as an Auto-Encoder. It uses Masked Language Model (MLM) to corrupt the input, and the objective of the model is to identify the masked token. It also uses self-attention, where each token in an input sentence looks at the bidirectional context (other tokens on left and right of the considered token).
On Contrast, GPT is trained as Auto-regressive model. It is trained with a language modelling objective, where the given sequence of tokens is used to predict the next token (Thus looking at only the past or left side context). It also uses Masked Attention to bring in auto-regressive approach into a transformer based model.
Thus it is not just about the pre-trained parameters, but more about the models and their objectives.
Upvotes: 2