Reputation: 1
I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multiple inputs or prompts ( open 2 website and send 2 prompts) , and it give me this errors:
**GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-backend.c:314: ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts"**
I have tried to use the code to check, it return that the layout is same
def same_layout(tensor1, tensor2):
return tensor1.flags.f_contiguous == tensor2.flags.f_contiguous
and tensor1.flags.c_contiguous == tensor2.flags.c_contiguous
tensor_a = np.random.rand(3, 4) # Creating a tensor
tensor_b = np.random.rand(3, 4) # Creating another tensor
print(same_layout(tensor_a, tensor_b))
and this is how i load for my model
llm = LlamaCPP(
#model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
model_path="C:/Users/ASUS608/AppData/Local/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
temperature=0.3,
max_new_tokens=512,
context_window=4096,
generate_kwargs={},
model_kwargs={"n_gpu_layers": 25},
messages_to_prompt=messages_to_prompt,
#completion_to_prompt=completion_to_prompt,
verbose=True,
)
What happen?
*update, and the next error is
**GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void *) (pool_addr + pool_used)**
Upvotes: 0
Views: 409
Reputation: 11
This error suggests that your system is possibly out of memory. I tried to use a 30b model once on my m1 pro macbook with 16GB RAM, llama cpp raised the same error.
Even though you're able to load the model, while inference, if it recieves two or more queries, it is unable to handle those requests because there is a memory limit error.
Another thing could be that a single instance of an LLM cannot handle multiple requests. So create a queue for your requests and process them one by one.
Upvotes: 0