Reputation: 9
Is fine-tuning a pre-trained transformer model a easier model an ‘easier’ task than training a transformer from scratch (BERT, GPT-2) in terms of GPU needs and GPU memory usage?
To clarify further, I’ve read how to train most transformer models, one would require multi-GPU training. However, is it possible to fine-tune some of these models on a single-GPU? Why is this the case?
Is it because we can use with smaller batches, the fine-tuning time is not as much as training from scratch?
Upvotes: 0
Views: 1268
Reputation: 191
A reason why fine-tuning a pre-trained model might ask for more resources than a transformer from scratch might be that you won't have control over the size of the pre-trained model. Although such models usually give multiple size options (small, base, large..) you might be able to obtain comparable results when training a model from scratch with a smaller size but for a specific task.
Upvotes: 0
Reputation: 16690
Yes, fine-tuning a pre-trained Transformer model is a typical way to go. The training hours required are prohibitively large (even hundred of thousands of hours of decent GPU card per model) while fine-tuning can be done on a single GPU. The reason is that fine-tuning entails training only a few layers on top of the output of the pre-trained model to tailor for a given task. As such, fine-tuning requires less data and significantly less training time to achieve good results.
Upvotes: 0