Serge Rogatch
Serge Rogatch

Reputation: 15090

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.

The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).

So why is that? Where is the bottleneck?

I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.

One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.

Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.

Upvotes: 0

Views: 4218

Answers (1)

Serge Rogatch
Serge Rogatch

Reputation: 15090

Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:

_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }

After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.

By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.

EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:

cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
  • cudaDeviceScheduleYield lets other threads execute when a CPU thread is waiting on GPU operation, rather than spinning GPU for the result.
  • cudaDeviceLmemResizeToMax, as I understood it, makes kernel launches themselves asynchronous and avoids excessive local memory allocations&deallocations.

Upvotes: 4

Related Questions