Amin
Amin

Reputation: 381

How to measure the inner kernel time in NVIDIA CUDA?

I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA? e.g.

__global__ void kernelSample()
{
  some code here
  get start time 
  some code here 
  get stop time 
  some code here
}

Upvotes: 25

Views: 19723

Answers (3)

paleonix
paleonix

Reputation: 3031

Nowadays Nvidia provides a (partial) implementation of std::chrono in their libcu++ standard library. So if one wants to measure time instead of cycles, one can use cuda::std::chrono::system_clock::now() from the cuda/std/chrono header to query a timestamp.

Note its documentation:

To implement std::chrono::system_clock, we use:

PTX’s %globaltimer is a system clock which also happens to be monotonically increasing on today’s NVIDIA GPUs (e.g. it cannot be updated and is not changed when the host system clock changes). However, this is not necessarily the case with respect to host threads, where updates of the system clock may occur during the execution of the program.

PTX’s %globaltimer is initialized from the host system clock upon device attach; that may be at program start, but it could be earlier (for example, due to CUDA persistence mode). Since PTX’s %globaltimer is a system clock, it counts real-world time, and thus it has the same tick rate as the host system clock.

There is potential for logical inconsistencies between the time that host threads and device threads observe from our std::chrono::system_clock. However, this is perfectly fine; it is an inherent property of system clocks. [...]

This is the reason there is no cuda::std::steady_clock as one would use for benchmarking/timing purposes on the host. But it also means that for the purpose of measuring time within a single thread cuda::std::system_clock is fine (assuming the resolution is good enough for what one is trying to measure).

Upvotes: 1

talonmies
talonmies

Reputation: 72348

You can do something like this:

__global__ void kernelSample(int *runtime)
{
  // ....
  clock_t start_time = clock(); 
  //some code here 
  clock_t stop_time = clock();
  // ....

  runtime[tidx] = (int)(stop_time - start_time);
}

Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple of seconds, so you should be sure that the duration of code between successive calls is quite short. You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cudaobjdump to check).

Upvotes: 52

edocetirwi
edocetirwi

Reputation: 548

Try this, it measures time between 2 events in milliseconds.

  cudaEvent_t start, stop;
  float elapsedTime;

  cudaEventCreate(&start);
  cudaEventRecord(start,0);

 //Do kernel activity here

 cudaEventCreate(&stop);
 cudaEventRecord(stop,0);
 cudaEventSynchronize(stop);

 cudaEventElapsedTime(&elapsedTime, start,stop);
 printf("Elapsed time : %f ms\n" ,elapsedTime);

Upvotes: -1

Related Questions