Reputation: 101
Assume that we run a kernel function with 4 blocks {b1, b2, b3, b3}. Each the blocks requires {10, 2, 3, 4} amount of time to complete job. And our GPU could process only 2 blocks in parallel.
If then, which one is correct way how our GPU work?
Upvotes: 0
Views: 269
Reputation: 50836
To quote this document from Nvidia:
Threadblocks are assigned to SMs
- Assignment happens only if an SM has sufficient resources for the entire threadblock
- Resources: registers, SMEM, warp slots
- Threadblocks that haven’t been assigned wait for resources to free up
- The order in which threadblocks are assigned is not defined
- Can and does vary between architectures
Thus, without more information, the two scheduling are theoretically possible. In practice, this is even more complex since there are many SMs on a GPU and AFAIK each SM can now execute multiple blocks concurrently.
Upvotes: 1