MutomboDikey
MutomboDikey

Reputation: 175

Causes of Low Achieved Occupancy

Nvidia web-site mentions a few causes of low achieved occupancy, among them uneven distribution of workload among blocks, which results in blocks hoarding shared memory resources and not releasing them until block is finished. The suggestion is to decrease the size of a block, thus increasing the overall number of blocks (given that we keep the number of threads constant, of course).

A good explanation on that was also given here on stackoverflow.

Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)? That is, unless you need to make sure that a larger number of threads needs to communicate through shared memory, I assume.

Upvotes: 0

Views: 1802

Answers (2)

Krzysztof
Krzysztof

Reputation: 779

Usually it is a good approach to use as small blocks as possbile but big enough to saturate device (64 or 128 threads per block depending on device) - it is not always possible since you might want to synchronize threads or communicate via shared memory.

Having large number of small blocks allows GPU to do kind of "autobalancing" and keep all SMs running.

The same applies to CPU - if you have 5 independent taks and each takes 4 seconds to finish, but you have only 4 cores then it will end after 8 seconds(during first 4 seconds 4 cores are running on first 4 tasks and then 1 core is running on last task and 3 cores are idling). If you are able to divide whole job to 20 tasks that take 1 second then whole job will be done in 5 seconds. So having a lot of small tasks helps to utilize hardware.

In case of GPU you can have large number of active blocks (on Titan X it is 24 SM x 32 active blocks = 768 blocks) and would be good to use this power. Anyway it is not always true that you need to fully saturate device. On many tasks I can see that using 32 threads per block (so having 50% possible occupancy) gives same performance as using 64 threads per block. In the end all is a matter of doing some benchmarks, and choosing whatever is best for you in given case with given hardware.

Upvotes: 0

talonmies
talonmies

Reputation: 72349

Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)?

No.

As shown in the documentation here, there is a limit on the number of blocks per multiprocessor which would leave you with a maximum theoretical occupancy of 25% or 50% when using 32 thread blocks, depending on what hardware you run the kernel on.

Upvotes: 2

Related Questions