Transformer model fine-tuning on single GPU

Is fine-tuning a pre-trained transformer model a easier model an ‘easier’ task than training a transformer from scratch (BERT, GPT-2) in terms of GPU needs and GPU memory usage?

To clarify further, I’ve read how to train most transformer models, one would require multi-GPU training. However, is it possible to fine-tune some of these models on a single-GPU? Why is this the case?

Is it because we can use with smaller batches, the fine-tuning time is not as much as training from scratch?

Upvotes: 0

Answers (2)

Chels

Reputation: 191

A reason why fine-tuning a pre-trained model might ask for more resources than a transformer from scratch might be that you won't have control over the size of the pre-trained model. Although such models usually give multiple size options (small, base, large..) you might be able to obtain comparable results when training a model from scratch with a smaller size but for a specific task.

Upvotes: 0

sophros

Reputation: 16690

Yes, fine-tuning a pre-trained Transformer model is a typical way to go. The training hours required are prohibitively large (even hundred of thousands of hours of decent GPU card per model) while fine-tuning can be done on a single GPU. The reason is that fine-tuning entails training only a few layers on top of the output of the pre-trained model to tailor for a given task. As such, fine-tuning requires less data and significantly less training time to achieve good results.

Upvotes: 0

Transformer model fine-tuning on single GPU

Answers (2)

Related Questions