ZHANG Juenjie
ZHANG Juenjie

Reputation: 531

What does it mean by say GPU under ultilization due to low occupancy?

I am using NUMBA and cupy to perform GPU coding. Now I have switched my code from a V100 NVIDIA card to A100, but then, I got the following warnings:

  1. NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.

  2. NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.

Does anyone know what the two warnings really suggests? How should I improve my code then?

Upvotes: 3

Views: 7589

Answers (2)

Whatever
Whatever

Reputation: 11

About the first message, if you still want to run it with a small grid size, just set the environmental variable NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS to 0 in terminal:

$ conda env config vars set NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS=0

To see more details, check Numba - Reference Manual - Environment Variables

Upvotes: 1

Robert Crovella
Robert Crovella

Reputation: 151944

NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.

A GPU is subdivided into SMs. Each SM can hold a complement of threadblocks (which is like saying it can hold a complement of threads). In order to "fully utilize" the GPU, you would want each SM to be "full", which roughly means each SM has enough threadblocks to fill its complement of threads. An A100 GPU has 108 SMs. If your kernel has less than 108 threadblocks in the kernel launch (i.e. the grid), then your kernel will not be able to fully utilize the GPU. Some SMs will be empty. A threadblock cannot be resident on 2 or more SMs at the same time. Even 108 (one per SM) may not be enough. A A100 SM can hold 2048 threads, which is at least two threadblocks of 1024 threads each. Anything less than 2*108 threadblocks in your kernel launch may not fully utilize the GPU. When you don't fully utilize the GPU, your performance may not be as good as possible.

The solution is to expose enough parallelism (enough threads) in your kernel launch to fully "occupy" or "utilize" the GPU. 216 threadblocks of 1024 threads each is sufficient for an A100. Anything less may not be.

For additional understanding here, I recommend the first 4 sections of this course.

NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.

One of the cool things about a numba kernel launch is that I can pass to it a host data array:

a = numpy.ones(32, dtype=numpy.int64)
my_kernel[blocks, threads](a)

and numba will "do the right thing". In the above example it will:

  1. create a device array that is for storage of a in device memory, let's call this d_a
  2. copy the data from a to d_a (Host->Device)
  3. launch your kernel, where the kernel is actually using d_a
  4. when the kernel is finished, copy the contents of d_a back to a (Device->Host)

That's all very convenient. But what if I were doing something like this:

a = numpy.ones(32, dtype=numpy.int64)
my_kernel1[blocks, threads](a)
my_kernel2[blocks, threads](a)

What numba will do is it will perform steps 1-4 above for the launch of my_kernel1 and then perform steps 1-4 again for the launch of my_kernel2. In most cases this is probably not what you want as a numba cuda programmer.

The solution in this case is to "take control" of data movement:

a = numpy.ones(32, dtype=numpy.int64)
d_a = numba.cuda.to_device(a)
my_kernel1[blocks, threads](d_a)
my_kernel2[blocks, threads](d_a)
a = d_a.to_host()

This eliminates unnecessary copying and will generally make your program run faster, in many cases. (For trivial examples involving a single kernel launch, there probably will be no difference.)

For additional understanding, probably any online tutorial such as this one, or just the numba cuda docs, will be useful.

Upvotes: 12

Related Questions