grey
grey

Reputation: 59

OutOfMemoryError: CUDA out of memory in LLM

I have a list of texts and I need to send each text to large language model(llama2-7b). However I am getting CUDA out of memory error. I am running on A100 on Google Colab. Here is my try:

path = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = LlamaTokenizer.from_pretrained(path)
model = LlamaForCausalLM.from_pretrained(path).to("cuda")


def interact_with_model(query, input_text=""):
    if pd.isna(input_text): return ("NaN")
    else:
        prompt = query + input_text
        inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to("cuda")
        generate_ids = model.generate(**inputs)
        output = tokenizer.batch_decode(outputs)[0]
        return output


def process_data(query,batch):
  responses = []
  for i in range(len(batch)):
      reponse = interact_with_model(query, batch[i])
      responses.append(reponse)
  return responses



query_1 = "Summarize the following text"
responses_1 = []
for i in range(0,len(inputs),50):
    sub_inputs = inputs[i:i+50]
    responses_1.append(process_data(query_1, batch=sub_inputs))

I tried to send texts to model by 50 sample in each time but this didn't work also. Where is the issue?

Upvotes: 0

Views: 1159

Answers (1)

Karl
Karl

Reputation: 5383

Your GPU doesn't have enough memory for the size of the inputs you are using. Reduce batch size to 1, reduce generation length to 1 token. Check memory usage, then increase from there to see what the limits are on your GPU.

Upvotes: 0

Related Questions