Reputation: 14316
If you use ChatGPT web app it answers typing token by token. It you use it through API you get the whole answer at once.
My assumption was that they provide token by token answers in the web app for UX reasons (easier reading maybe, a sneaky way to limit the amount of user's prompts by making them wait longer for the answer).
Today I downloaded llama cpp app and played around with the models from Hugging Face.
What made me wonder was that the llama CLI was also printing the answers token by token. While it is typing it is using ~70% of my CPU. The moment it stops typing the CPU usage drops to 0%. If the output is long the CPU stays on 70% for longer.
It looks like the answer tokens are actually pulled from the model one by one and the more tokens you want, the longer it takes to generate.
However my initial understanding was that a model always returns the answers of the same length (just 0 padded if less text makes more sense). I also assumed that the model repose time is invariant to the length of the prompt and the generated output.
What am I missing? How does it really work?
Upvotes: 3
Views: 1912
Reputation: 11
LLMs do generate answers token by token. So the way LLM generates the answer is that it takes the user input + the already generated part of the answer as context to predict the next token. Therefore, it is not possible to generate all answers at once as to generate a token it needs the preceding part of the answer as context.
Upvotes: 1