vydesaster
vydesaster

Reputation: 253

CUDA block size and grid size for changing hardware

Imagine I have developed a CUDA kernel and tuned the block size and grid size for optimal performance on my machine. But if I give my application to a customer with a different GPU he might need other settings for grid size and block size to gain optimal performance. How do I change the grid size and block size during runtime so that my kernel runs optimal on different GPUs?

Upvotes: 1

Views: 1534

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151879

When you change grid size, you are changing the total number of threads. Focusing on total threads, then, the principal target is the maximum in-flight thread-carrying capacity of the GPU you are running on.

A GPU code that seeks to maximize the utilization of the GPU it is running on, should attempt to have at least that many threads. Less can be bad, more is not likely to make a big difference.

This target is easy to calculate. For most GPUs it is 2048 times the number of SMs in your GPU. (Turing GPUs have reduced the maximum thread load per SM from 2048 to 1024).

You can find out the number of SMs in your GPU at runtime using a call to cudaGetDeviceProperties() (study the deviceQuery sample code).

Once you know the number of SMs, multiply it by 2048. That is the number of threads to launch in your grid. At this level of tuning/approximation, there should be no need to change the tuned number of threads per block.

It's true that your specific code may not be able to actually achieve 2048 threads on each SM (this is related to discussions of occupancy). However, for a simplistic target, this won't hurt anything. If you already know the actual occupancy capability of your code, or have used the occupancy API to determine it, then you can scale down your target from 2048 threads per SM to some lower number. But this scaling down probably won't improve the performance of your code by much, if at all.

Upvotes: 4

Related Questions