What is Sharding in the FSDP, and how is FSDP different from Pipeline Parallel?

Huggingface explains FSDP as:

sharding the model parameters, gradients, and optimizer states across data parallel processes and it can also offload sharded model parameters to a CPU.

And Pipeline parallelism as :

split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline...

How are these different? If I have a simple 2-layer model, what does FSDP do that pipeline parallel does not? I've read descriptions that say something like "in PP the model is split across GPU while in FSDP the model is sharded. Doesn't shared just mean split in parts? TIA!

Upvotes: 1

Answers (2)

basujindal

Reputation: 1

FSDP is orthogonal to Pipeline Parallelism and hence both can be used together. Below is a paragraph from the FSDP paper explaining the same.

Pipeline parallel can be functionally integrated with FSDP by employing FSDP to wrap each individual pipeline stage. However, as pipeline parallel divides input mini-batches into smaller microbatches, the default full sharding strategy in FSDP would have to unshard model parameters for every micro-batch. Consequently, combining these approaches with default FSDP configurations may lead to significant communication overhead. Fortunately, FSDP offers alternative sharding strategies that can keep parameters unsharded after the forward pass, avoiding unnecessary AllGather communications per micro-batch. Admittedly, this requires storing parameters of an entire pipeline stage on the GPU device, but FSDP can still reduce memory usage as it still shards gradients and optimizer states.

FSDP groups model layers into units of N layers each (N depends upon the FSDP config). Each unit's layers then are sharded (split the weights of each layer) across multiple GPUs.

Upvotes: 0

J369

Reputation: 477

In FSDP each GPU gets part of each layer, in pipeline parallelism each GPU gets its own layers.

Say you have 2 gpus.

For FSDP, layer one will get sharded, and layer two will get sharded. Each GPU will have their own independent batch. For the first step, neither gpu can compute layer one, since they each only have half of that layer. So they will send their portions of layer one to each other and then compute their outputs of layer one. Note that we are saving memory since layer 2 is still split up. Then, they will reshard layer 1 to free memory, and send layer two to each other. The main idea behind FSDP is that a gpu only has the entirety of a layer when it's using that layer. Otherwise that layer is sharded.

In pipeline parallelism on the other hand, instead of splitting each layer and communicating them as necessary，one GPU will always have layer one, and the other GPU will always have layer two. The first GPU will send it's output of layer 1 to the second GPU.

Upvotes: 4

What is Sharding in the FSDP, and how is FSDP different from Pipeline Parallel?

Answers (2)

Related Questions