HelloALive
HelloALive

Reputation: 1

Unable for sending multiple input using Llama CPP and Llama-index

I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multiple inputs or prompts ( open 2 website and send 2 prompts) , and it give me this errors: **GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-backend.c:314: ggml_are_same_layout(src, dst) && "cannot copy tensors with different layouts"**

I have tried to use the code to check, it return that the layout is same

def same_layout(tensor1, tensor2):
  return tensor1.flags.f_contiguous == tensor2.flags.f_contiguous
     and tensor1.flags.c_contiguous == tensor2.flags.c_contiguous

tensor_a = np.random.rand(3, 4) # Creating a tensor
tensor_b = np.random.rand(3, 4) # Creating another tensor
print(same_layout(tensor_a, tensor_b))

and this is how i load for my model

llm = LlamaCPP(
#model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
model_path="C:/Users/ASUS608/AppData/Local/llama_index/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
temperature=0.3,
max_new_tokens=512,
context_window=4096,
generate_kwargs={},
model_kwargs={"n_gpu_layers": 25},
messages_to_prompt=messages_to_prompt,
#completion_to_prompt=completion_to_prompt,
verbose=True,
)

What happen?

*update, and the next error is **GGML_ASSERT: D:\a\llama-cpp-python\llama-cpp-python\vendor\llama.cpp\ggml-cuda.cu:352: ptr == (void *) (pool_addr + pool_used)**

Upvotes: 0

Views: 409

Answers (1)

Raj Singh Parihar
Raj Singh Parihar

Reputation: 11

This error suggests that your system is possibly out of memory. I tried to use a 30b model once on my m1 pro macbook with 16GB RAM, llama cpp raised the same error.

Even though you're able to load the model, while inference, if it recieves two or more queries, it is unable to handle those requests because there is a memory limit error.

Another thing could be that a single instance of an LLM cannot handle multiple requests. So create a queue for your requests and process them one by one.

Upvotes: 0

Related Questions