Isbister
Isbister

Reputation: 946

How can I lock the global batch size? (DeepSpeed)

For context, I am doing a full finetune of a LLM (meta-llama/Llama-3.1-8B) on a HPC cluster with A100 (40gb gpus) on a rather large corpus of text.

The training setup consists of using the SFTTrainer from huggingface transformers.

And the distributed training is handled with accelerate + deepspeed ZeRO2.

The "issue" I am facing is that when I increase the number of nodes is my SLURM config, the global batch sizes increases, since, it seems to be a function of the number of total gpus.

Currently I have this config;

per_device_train_batch_size = 1
gradient_accumulation_steps = 8

and a single node has 4x A100 GPUs.

So e.g using 16 nodes I get global batch size = per_device_train_batch_size * gradient_accumulation_steps * nodes * gpu_per_node => 1 * 8 * 16 * 4 => 512.

If I launch the same training with 32 nodes the global batch size becomes 1024, which means I get half the amount of gradient updates, which prevents the model to converge, since the training completes in half the time with the larger batch size (about 1M tokens per batch).

I can drop down the gradient_accumulation_steps ofc, but ideally I would like to lock the global batch size in case I launch a training on hundreds of nodes.

trainer.yaml:

learning_rate: 2e-5
warmup_steps: 100
lr_scheduler: cosine
optimizer: adamw_torch_fused
max_grad_norm: 1.0
gradient_accumulation_steps: 8
per_device_train_batch_size: 1
num_epochs: 1
sequence_len: 8192

deepspeed_zero2.json:

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    },
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Relevant parts of the slurm config;

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=test
#SBATCH --partition=standard-g
#SBATCH --cpus-per-task=56
#SBATCH --nodes=16
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=480G
#SBATCH --exclusive
#SBATCH -t 48:00:00

#Variables for distributed enviroment
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$((SLURM_GPUS_ON_NODE*SLURM_NNODES))

accelerate launch \
    --rdzv_conf "rdzv_backend=c10d,rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT" \
    --config_file $ACCELERATE_CONFIG_FILE \
    --num_machines $SLURM_NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --role \$(hostname -s) \
    --tee 3 \

Is it possible to lock the global batch size? I read in the documentation that:

train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.

Does it make sense to only set train_micro_batch_size_per_gpu and train_batch_size? And leavning gradient_accumulation_steps empty?

Upvotes: 0

Views: 68

Answers (0)

Related Questions