Reputation: 1597
I am getting a lot of profiling overhead when trying to profile my code using nvvp
(or with nvprof
):
Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.
How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?
I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart
/cudaProfilerStop
like so:
/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();
where applyConvolution()
is the function that I'm profiling.
I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.
Upvotes: 4
Views: 907
Reputation: 21
I am seeing something similar, but which is perhaps only vaguely related. But since the above answer helped, I'll add my observations.
On profiling a Quadro GV100 there is a massive change in apparent performance for fairly simple kernels compared to pascal-gen cards (e.g. a 1080). I too am running nvvp with profiling disabled and activating it in a part of the code I'm interested in. Then I accidentally omitted to turn it on, and all I got was our manual event markers (using nvtxRangePush & nvtxRangePop). What do you know, a tenfold-speedup. That is to say; on the Quadro GV100 there is a massive profiling overhead that is not there on earlier-gen GPUs.
Disabling concurrent profiling as you did does NOT help, but disabling the API tracing DOES.
There's still a significant overhead compared to manual nvtx though, but at least it allows some idea of kernel performance on the GV100. Larger kernels seem less affected, which is natural if it's related to fixed-cost overhead or API-tracing. The unknown that's left is why the API-tracing costs so much on the GV100 specifically, but I'm in no position to speculate, at least not yet.
I compiled sm-specific binaries using gcc/5.4 and cuda/9.0 for the above tests, and ran RELION single-threaded for a simple test-case.
Upvotes: 1
Reputation: 1597
As I was writing up this question, I thought I'd try messing around with the profiler settings to try and preempt some potential answer-in-comment material.
To my surprise, disabling "Enable concurrent kernel profiling" got rid of the profiler overhead completely:
But perhaps this shouldn't have been that much of a surprise:
Enable concurrent kernel profiling - This option should be selected for an application that uses CUDA streams to launch kernels that can execute concurrently. If the application uses only a single stream (and therefore cannot have concurrent kernel execution), deselecting this option may decrease profiling overhead.
(taken from http://docs.nvidia.com/cuda/profiler-users-guide/)
An earlier version of the CUDA Profiler User's Guide also noted in a "Profiling Limitations" section that:
Concurrent kernel mode can add significant overhead if used on kernels that execute a large number of blocks and that have short execution durations.
Oh well. Posting this question/answer anyways in case it helps someone else avoid this annoyance.
Upvotes: 5