Reputation: 15744
I have a CUDA kernel that calls out to a series of device functions.
What is the best way to get the execution time for each of the device functions?
What is the best way to get the execution time for a section of code in one of the device functions?
Upvotes: 10
Views: 2753
Reputation: 1384
In my own code, I use the clock()
function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers
in the host code.
A few notes:
#ifdef USETIMERS
so you can switch them off easily.clock()
returns integer values of type clock_t
, I store the accumulated values as float
, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )
is necessary in case the clock counter wraps around.P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.
Upvotes: 7