Reputation: 59
I have a list of texts and I need to send each text to large language model(llama2-7b). However I am getting CUDA out of memory error. I am running on A100 on Google Colab. Here is my try:
path = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = LlamaTokenizer.from_pretrained(path)
model = LlamaForCausalLM.from_pretrained(path).to("cuda")
def interact_with_model(query, input_text=""):
if pd.isna(input_text): return ("NaN")
else:
prompt = query + input_text
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to("cuda")
generate_ids = model.generate(**inputs)
output = tokenizer.batch_decode(outputs)[0]
return output
def process_data(query,batch):
responses = []
for i in range(len(batch)):
reponse = interact_with_model(query, batch[i])
responses.append(reponse)
return responses
query_1 = "Summarize the following text"
responses_1 = []
for i in range(0,len(inputs),50):
sub_inputs = inputs[i:i+50]
responses_1.append(process_data(query_1, batch=sub_inputs))
I tried to send texts to model by 50 sample in each time but this didn't work also. Where is the issue?
Upvotes: 0
Views: 1159
Reputation: 5383
Your GPU doesn't have enough memory for the size of the inputs you are using. Reduce batch size to 1, reduce generation length to 1 token. Check memory usage, then increase from there to see what the limits are on your GPU.
Upvotes: 0