Cuda out of memory while training

Question

I'm trying to finetune a base model with my own data using prompts, however I keep getting this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.25 GiB of which 177.50 MiB is free. Process 230359 has 79.07 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 28.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Despite working on a server with GPU 2x 80GB of RAM. Any idea what I'm doing wrong? this is not a huge base model, and I'm only trying to train it on 900 samples.

model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm2.0')
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm2.0')

df= df[['Prompt']]
df= df.sample(1000).reset_index(drop=True)


train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
})

dataset = dataset.map(remove_columns=["__index_level_0__"])


def tokenize_function(examples):
    return tokenizer(examples["Prompt"],max_length=512)

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['Prompt']
)

tokenized_dataset = tokenized_dataset.map(lambda example: {"labels": [1] * len(example["input_ids"])})

tokenizer.pad_token = tokenizer.eos_token


data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=tokenizer.pad_token_id)

args = TrainingArguments(
    'results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

trainer.train()

Cuda out of memory while training

Answers (0)

Related Questions