Key difference between BERT and GPT2?

Question

I read lots of articles and people are saying BERT is good for NLU while GPT is good for NLG. But the key difference in structure between them is just adding a mask or not in self-attention, and trained the model in different ways.

From the code below, if I understand correctly, we are free to choose add an attention mask or not. https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/modeling_gpt2.py

So can I come to the conclusion that it should "the pretrained parameters for BERT is good for NLU" and "the pretrained parameters for GPT2 is good for NLG"? Or is there any other critical difference between these two that make people come to this conclusion I mentioned at the beginning?

Key difference between BERT and GPT2?

Answers (1)

Related Questions