How is micro-batch-size influencing the throughput per GPU?

Question

I am testing how is micro-batch-size influencing the throughput per GPU with a constant global-batch-size on Megatron-LM.

I have done several tests with a 400M transformer based model on 2 A40 GPUS, and only use data parallelism. Here are some training Arguments

With different test I only change the micro-batch-size , trained on 100 iterations with seq_len =1024 and global-batch-size =24 . Here are some result with different micro-batch-size

I print the log every 5 iterations with the megatron-LM native functionality and compute the averaged throughput per GPU through all iterations. For each Iteration , the total computational complexity is the same , but throughput per GPU increases as the micro-batch-size increases. I know that may related to the GPU cache load or arithmetic intensity but not quite clear. Can anyone provide some in-depth explanations?

How is micro-batch-size influencing the throughput per GPU?

Answers (0)

Related Questions