Reputation: 381
I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA? e.g.
__global__ void kernelSample()
{
some code here
get start time
some code here
get stop time
some code here
}
Upvotes: 25
Views: 19723
Reputation: 3031
Nowadays Nvidia provides a (partial) implementation of std::chrono
in their libcu++ standard library. So if one wants to measure time instead of cycles, one can use cuda::std::chrono::system_clock::now()
from the cuda/std/chrono
header to query a timestamp.
Note its documentation:
To implement
std::chrono::system_clock
, we use:
- [...]
- PTX’s
%globaltimer
for device code.PTX’s
%globaltimer
is a system clock which also happens to be monotonically increasing on today’s NVIDIA GPUs (e.g. it cannot be updated and is not changed when the host system clock changes). However, this is not necessarily the case with respect to host threads, where updates of the system clock may occur during the execution of the program.PTX’s
%globaltimer
is initialized from the host system clock upon device attach; that may be at program start, but it could be earlier (for example, due to CUDA persistence mode). Since PTX’s%globaltimer
is a system clock, it counts real-world time, and thus it has the same tick rate as the host system clock.There is potential for logical inconsistencies between the time that host threads and device threads observe from our
std::chrono::system_clock
. However, this is perfectly fine; it is an inherent property of system clocks. [...]
This is the reason there is no cuda::std::steady_clock
as one would use for benchmarking/timing purposes on the host. But it also means that for the purpose of measuring time within a single thread cuda::std::system_clock
is fine (assuming the resolution is good enough for what one is trying to measure).
Upvotes: 1
Reputation: 72348
You can do something like this:
__global__ void kernelSample(int *runtime)
{
// ....
clock_t start_time = clock();
//some code here
clock_t stop_time = clock();
// ....
runtime[tidx] = (int)(stop_time - start_time);
}
Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple of seconds, so you should be sure that the duration of code between successive calls is quite short. You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cudaobjdump
to check).
Upvotes: 52
Reputation: 548
Try this, it measures time between 2 events in milliseconds.
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventRecord(start,0);
//Do kernel activity here
cudaEventCreate(&stop);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start,stop);
printf("Elapsed time : %f ms\n" ,elapsedTime);
Upvotes: -1