Optimal hyperparameters for fine tuning LLM

Question

could I ask you for help? I am doing fine tuning of LLM model Llama3 8b (with LoRA) for text classification. I am using Trainer from Huggingface. I am looking for the optimal learning_rate and per_device_train_batch_size. I am using AdamW_torch optimizer and have over 80,000 training data. With per_device_train_batch_size = 2, one epoch takes about 12 hours. So it is hard for me to find optimal values because the training is long. My best results so far are at 2 batch size, 1 epoch and learning rate 2e-4 --> 86.86% on validation data (about the same on test data). More epochs don't help, the model is starting to overfit whit this LR and batch size.

I am posting intermediate results here. Due to the long training time I do not send complete results (sometimes e.g. 1/3 of epochs etc..). Could someone give me an estimate based on these, what values of batch size and learning rate might help and what should be used for longer training?

`**1 EPOCH, LR 2e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 2**
Epoch 0.5: {'eval_loss': 0.8900082111358643, 'eval_accuracy': 0.8452}
Epoch 1.0: {'eval_loss': 0.6152949929237366, 'eval_accuracy': 0.8686}

**2 EPOCHS, LR 2e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 2**
Epoch 0.5: {'eval_loss': 0.9081119894981384, 'eval_accuracy': 0.8314}
Epoch 1.0: {'eval_loss': 0.7357500195503235, 'eval_accuracy': 0.8488}
Epoch 1.5: {'eval_loss': 0.7343335151672363, 'eval_accuracy': 0.8584}
Epoch 2.0: {'eval_loss': 0.5892598628997803, 'eval_accuracy': 0.8758} -> 87,1% on test data

**6 EPOCHS, LR 2e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 2 (closed after 5 epochs)**
Step Training Loss Validation Loss Accuracy
41572 0.811400 0.872795 0.834000
83144 0.788500 0.774350 0.836600
124716 0.777800 0.737788 0.845200
166288 0.547900 0.732708 0.858400
207860 0.416800 0.746941 0.866000

**5 EPOCHS, LR 1e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 2
[ 84599/207860 26:28:07 < 38:33:56, 0.89 it/s, Epoch 2.03/5]**
Step Training Loss Validation Loss Accuracy
20786 0.892400 0.824404 0.839000
41572 0.773900 0.792925 0.848000
62358 0.699800 0.842472 0.851800
83144 0.740500 0.713346 0.859400

**3 EPOCHS, LR 2e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 16
[13922/15591 12:29:10 < 1:29:49, 0.31 it/s, Epoch 2.68/3]**
Step Training Loss Validation Los Accuracy
2598 1.575400 1.473688 0.540800
5196 1.209300 1.213021 0.619800
7794 1.135000 1.119776 0.642600
10392 1.078100 1.078663 0.655800
12990 1.028700 1.062033 0.660600

**3 EPOCHS, LR 2e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 32
[1307/7797 2:19:22 < 11:33:09, 0.16 it/s, Epoch 0.50/3]**
Step Training Loss Validation Loss Accuracy
1299 2.121100 1.649759 0.497800

**3 EPOCHS, LR 16e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 16
[ 5222/15591 4:45:06 < 9:26:20, 0.31 it/s, Epoch 1.00/3]**
Step Training Loss Validation Loss Accuracy
2598 1.151000 1.053402 0.668400
5196 0.988100 1.030373 0.676200

**3 EPOCHS, LR 16e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 16
[ 2608/15591 2:21:56 < 11:47:10, 0.31 it/s, Epoch 0.50/3]**
Step Training Loss Validation Loss Accuracy
2598 3.142300 2.734620 0.642000

**1 EPOCH, LR 2e-3, PER_DEVICE_TRAIN_BATCH_SIZE = 2
[31814/41572 8:05:48 < 2:29:00, 1.09 it/s, Epoch 0.77/1]**
Step Training Loss Validation Loss Accuracy
20786 0.000000 nan 0.079200 -> 7% accuracy :) :) :) :), LR is too high

**1 EPOCH, LR 4e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 2**
Epoch 0.5: {'eval_loss': 1.260171890258789, 'eval_accuracy': 0.6548,

**1 EPOCH, LR 1e-4, PER_DEVICE_TRAIN_BATCH_SIZE = 2
[21434/41572 6:42:15 < 6:17:58, 0.89 it/s, Epoch 0.52/1]**
Step Training Loss Validation Loss Accuracy
20786 0.864800 0.804458 0.851000

**1 EPOCH, LR 5e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 2**
Epoch 0.5: 'eval_loss': 0.7843595743179321, 'eval_accuracy': 0.8532
Epoch 1.0: eval_loss': 0.6852059364318848, 'eval_accuracy': 0.8632`

My View: If I use batch size 2, learning rate 2e-4 and smaller, I can't see much progress. The results after running 1/1 are almost the same as after 5/5 (in 1/5 are worse than 1/1) If I choose e.g. batch size 16, the results drop to values around 60%, then rise quickly, but stop again and rise slowly. With a larger batch size I get better intermediate results if I also increase the learning rate. I've been doing linear scaling --- for example, Batch from 2 to 16 --> original LR * 8.

Isn't this weird? -> If I choose this: 3 EPOCHS, LR 16e-5, PER_DEVICE_TRAIN_BATCH_SIZE = 16, I have 66.8% val accuracy with 1.151000 training loss and 1.053402 val loss at mid epoch, but if I put LR 16e-4, I have very similar accuracy with (64.2%) but 3.142300 training loss and 2.734620 val loss. What does this mean please???

Does it mean that this model with more train and validation loss would have more potential to improve?
Should I scale the learning rate linearly with increasing batch size in AdamW-torch optimizer?

I considered using tools like Optuna or Rey Tune to find the best hyperparameters. But would it be worth it at all if the training takes this long and isn't it better to try it myself?

My main question is what would you choose as optimal parameters for longer training as I am not very experienced with tuning hyperparameters of Large Language Models.

I would be grateful to anyone who can help me. Thank you.

Optimal hyperparameters for fine tuning LLM

Answers (0)

Related Questions