Reputation: 35525
I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2
times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2
. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage
I get (click to enlarge):
From here I see several things:
From the kernel executions "bars", top and right I can see:
I follow my profiling by running Perform Kernel Analysis
, getting:
I can see here that
Something else?
I continue by Perform Latency Analysis
, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
a=a+1;a=a*a;b=b+1;b=b*b;
to a=a+1;b=b+1;a=a*a;b=b*b;
? Questions:
Upvotes: 1
Views: 1330
Reputation: 905
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.
Upvotes: 3