Tolu
Tolu

Reputation: 1147

Finetuning a LM vs prompt-engineering an LLM

Is it possible to finetune a much smaller language model like Roberta on say, a customer service dataset and get results as good as one might get with prompting GPT-4 with parts of the dataset?

Can a fine-tuned Roberta model learn to follow instructions in a conversational manner at least for a small domain like this?

Is there any paper or article that explores this issue empirically that I can check out?

Upvotes: 3

Views: 2739

Answers (2)

Nathaniel Mahowald
Nathaniel Mahowald

Reputation: 1

I found a piece which offers another perspective here. Certainly when doing straight forward performance testing there's a tradeoff. But another factor to consider is whether all your test cases fall within or very close to the original data, and to what extent you might later want to adjust model behavior as your use case evolves. Fine tuning is much more rigid in cases where you need to change behavior based on things you discover along the way, and sometimes doesn't adapt well to unexpected situations.

Upvotes: 0

Tolu
Tolu

Reputation: 1147

I found a medium piece which goes a long way in clarifying this here.

Quoting from the conclusion in the above,

In the low data domain, prompting shows superior performance to the respective fine-tuning method. To beat the SOTA benchmarks in fine-tuning, leveraging large frozen language models in combination with tuning a soft prompt seems to be the way forward.

It appears prompting an LLM may outperform fine tuning a smaller model on domain-specific tasks if the training data is small and vice versa if otherwise.

Additionally, in my own personal anecdotal experience with ChatGPT, Bard, Bing, Vicuna-3b, Dolly-v2-12b and Illama-13b, it appears models of the size of ChatGPT, Bard and Bing have learned to mimic human understanding of language well enough to be able to extract meaningful answers from context provided at inference time. It seems to me the smaller models do not have that mimicry-mastery and might not perform as well with in-context learning at inference time. They might also be too large to be well suited for fine-tuning in a very limited domain. My hunch is that for very limited domains, if one is going the fine-tuning route, fine-tuning on much smaller models like BERT or Roberta (or smaller variants of GPT-2 or GPT-J, for generative tasks) rather than on these medium-sized models might be the more prudent approach resource-wise.

Another approach to fine tuning the smaller models on domain data could be to use more carefully and rigorously crafted prompts with the medium-sized models. This could be a viable alternative to using the APIs provided by the owners of the very large proprietary models.

Upvotes: 7

Related Questions