Reputation: 13
My GPU is of capability 2.1, with 2 SMs, and each SM has 48 cores. According to the Technical Specifications provided in CUDA-C Programming Guide, Maximum number of blocks of a grid is 65535, and Maximum number of resident blocks per multiprocessor is 8.
I am confused about how much blocks I can launch. If the maximum of blocks per SM is 8, doesn't that mean I could launch at most 16 blocks if there are only 2 SMs? But I successfully launched much more blocks.
Maybe there are such things like active blocks and inactive blocks? If this is the fact then how these blocks are scheduled? Does the inactive waits till all 8 active blocks are finished? But this brings up synchronization problems...
some more questions...if there are 48 cores on each SM, then there can be 3 half-warps executing at the same time. But the shared memory has only 32 banks. If two threads try to read from the same band concurrently, won't they produce bankconflict even if they belong to different half-warp?
Upvotes: 1
Views: 3365
Reputation: 963
I arrived definitely late to the party but since the previous answer was not accepted, I provide one with the hope of helping other users with the same question.
The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
Upvotes: 2
Reputation: 12099
According to the Technical Specifications provided in CUDA-C Programming Guide, Maximum number of blocks of a grid is 65535, and Maximum number of resident blocks per multiprocessor is 8.
I am confused about how much blocks I can launch. If the maximum of blocks per SM is 8, doesn't that mean I could launch at most 16 blocks if there are only 2 SMs?
The maximum number of blocks (per dimension in a grid) is a limitation on what the CUDA scheduler can handle. Except for the recent Kepler GPUs, the limitation is 65535 along each d imension.
Practically the number of active blocks is dependent on a lot of things. There is a hard limitation on number of blocks each SM can launch, but the number can also be smaller if you use large amounts of shared memory, registers or threads per block.
The scheduler switches out inactive blocks (i.e. blocks that are stalling for various reasons) and switches in active ones. A large number of blocks are launched than physically possible to keep the SMs as active as possible.
But this brings up synchronization problems...
Never assume CUDA blocks are launched in order. They can be processed out of order and the only synchronization point is to finish the kernel and cudaDeviceSynchronize
on the host.
Upvotes: 1