Causes of Low Achieved Occupancy

Question

Nvidia web-site mentions a few causes of low achieved occupancy, among them uneven distribution of workload among blocks, which results in blocks hoarding shared memory resources and not releasing them until block is finished. The suggestion is to decrease the size of a block, thus increasing the overall number of blocks (given that we keep the number of threads constant, of course).

A good explanation on that was also given here on stackoverflow.

Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)? That is, unless you need to make sure that a larger number of threads needs to communicate through shared memory, I assume.

talonmies · Accepted Answer

Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)?

No.

As shown in the documentation here, there is a limit on the number of blocks per multiprocessor which would leave you with a maximum theoretical occupancy of 25% or 50% when using 32 thread blocks, depending on what hardware you run the kernel on.

Causes of Low Achieved Occupancy

Answers (2)

Related Questions