Reputation: 5681
I am using nvprof to measure achieved occupancy and I am findind it as
Achieved Occupancy 0.344031 0.344031 0.344031
but using occupancy calculator , I am finding 75%.
The results are:
Active Threads per Multiprocessor 1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor 75%
I am using 33 registers , 144 bytes shared memory , 256 threads/block ,device capability 3.5.
EDIT:
Also , something I want to clarify.In http://docs.nvidia.com/cuda/profiler-users-guide/#axzz30pb9tBTN it states for
gld_efficiency
Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage
So , If this is 0% it means that I have no global memory transfers in the kernel?
:
Upvotes: 0
Views: 1613
Reputation: 72349
You need to understand that the occupancy calculator is providing the maximum theoretical occupancy that a given kernel can achieve, based only on the resource requirements of that kernel. It does not (and cannot) say anything about how much of that theoretical occupancy the code is capable of achieving.
The profiling tools, on the other hand, deduce actual occupancy from measured profile counters. According to this document, the achieved occupancy number you are asking about is calculated as
(active_warps / active_cycles) / MAX_WARPS_PER_SM
ie. it samples the number of active warps on one or more SM during a kernel run and calculates actual occupancy from that
There can be a lot of reasons why a kernel doesn't achieve its theoretical occupancy, and (before you ask), no I can't tell you why your kernel doesn't reach theoretical occupancy. But the Visual Profiler can. If it is important to you, I suggest you look at the automated performance analysis features available in the CUDA 5/6 visual profiler as a way of better understanding the performance of your code.
It is also worth pointing out that occupancy should be treated as only a rough metric of potential code performance, and high theoretical occupancy doesn't always translate into high performance. Instruction level parallelism and latency minimisation strategies can also be very effective at reaching high levels of performance, even at low occupancy. There is a large body work on this, most stemming from Vasily Volkov's seminal GTC 2010 paper.
Upvotes: 2