Reputation: 365
I've been comparing various langchain compatible llama2 runtimes, using langchain llm chain. Having the following parameter overrides:
# llama.cpp:
model_path="../llama.cpp/models/generated/codellama-instruct-7b.ggufv3.Q5_K_M.bin",
n_ctx = 2048,
max_tokens = 2048,
temperature = 0.85,
top_k = 40,
top_p = 0.95,
repeat_penalty = 1.1,
seed = 112358,
# ctransformer:
model="../llama.cpp/models/generated/codellama-instruct-7b.ggufv3.Q5_K_M.bin",
config={
"context_length": 2048,
"max_new_tokens": 2048,
"temperature": 0.85,
"top_k": 40,
"top_p": 0.95,
"repetition_penalty" :1.1,
"seed" : 112358
},
The model is derived from original codellama-7b-instruct, using methods suggested for llama.cpp.
The system and user prompts are the same. And the prompt template is from the codellama paper.
template = """<s>[INST] <<SYS>>
{system}
<</SYS>>
{user} [/INST]"""
system = """You are very helpful coding assistant who can write complete and correct programs in various programming languages, expecially in java and scala."""
The ctransformer based completion is adequate, but the llama.cpp completion is qualitatively bad, often incomplete, repetitive, and sometimes stuck in a repeat loop.
Apart from the overrides, I have verified that the defaults AFAIK are the same for both implementations.
What aspects can I check more, to bring llama.cpp to behave the same, since I'm more interested in using llama.cpp.
Upvotes: 1
Views: 1168
Reputation: 1
Firstly, by nature LLMs are non-deterministic. What they predict is a stochastic (statistical) process. How stochastic, is for example influenced by the "temperature" setting. In simplified terms: when the machine tries to predict the next token (eg next word in the answer it gives you), it selects one option out of many possible ones. How much leeway it has, is affected by the temperature. Let's say there are 100 possible next words in a list ordered by likelihood of being the correct one. With a temperature of 0, the most likely one is picked. If you repeat the same prompt, you are likely to get a very similar and occasionally identical answer. With a higher temperature, one of a range of likelihoods is picked, the range increasing with the likelihood. If you want repeatability, a low temperature gives you "better" answers. If you want creativity, a higher temperature will.
Llamacpp usually expects yo to provide the correct prompt template, whereas some other apps will automatically select the most appropriate prompt template for you. How well an LLM responds, will very much depend on selecting the correct prompt template. This might be your main problem. For each model you use, check the required prompt template eg on the huggingface model card, eg https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Upvotes: 0
Reputation: 6791
One situation when fixed seeds still manage to produce different answers is when the repeated prompts and answers get included into the context window. For LLMs like Mixtral-8x7B-Instruct-v0.1 it's sufficient to issue the same prompt 3 times and include the 3 identical prompts and 2 identical answers into the context window (prompt number 3) to obtain a slightly different answer... despite the fixed seed that normally works correctly (i.e. when the prompt is unaffected by the conversation history).
Upvotes: 0
Reputation: 6791
Fixing the seeds in both frameworks should be sufficient to give reproducible results regardless of other inference parameters, but I noticed another problem with this experiment: these temperature
and top_k
settings are not really useful for the task of code generation, in fact such wide-ranging distribution should be probably avoided even if the most diverse and creative outputs were expected. For instance, on OpenAI forums you can find advice that for code generation temperature
should be set at 0.2 and top_p
at 0.1 (see this post). The highest recommended settings (for Creative Writing) are 0.7 for temperature
and 0.8 for top_p
. See if making these changes makes any difference to your A/B test.
Upvotes: 1