Reputation: 5535
I am programming my first real application in CUDA, and I have come to the point where I need to know how long executions of kernels take. However, as said in the title, I do not understand why, in applications that run a kernel more than once, the time taken for the second launch of the kernel is much much shorter than the time taken for the first.
For example, in the code below:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <iostream>
#include <stdio.h>
void runCuda(unsigned int size);
__global__ void addKernel(const int arraySize)
{
1 + 1;
}
void doStuff(int arraySize)
{
auto t1 = std::chrono::high_resolution_clock::now();
addKernel <<<(arraySize + 31) / 32, 32 >>> (arraySize);
cudaDeviceSynchronize();
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Duration: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
cudaDeviceReset();
}
int main()
{
doStuff(1e6);
doStuff(1e6);
return 0;
}
the kernel only does some basic addition, and it is called one million times. The output of the program above is usually something like:
Duration: 1072
Duration: 97
the two numbers change, but remain consistently around 1000 and 100. The fact that the same kernel runs so much faster the second time makes no sense to me.
Upvotes: 3
Views: 2172
Reputation: 1
The better timing method for kernels can be found in "CUDA C++ best Practices Guide", like the following code:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y,
NUM_REPS);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
Upvotes: 0
Reputation: 1
You will find that on your first run almost all of the extra time is spent on your first cudaMalloc(). This is a initialization where it is determining device and swap and memory conditions that can only partly be mitigated.
Upvotes: 0
Reputation: 108
there's overhead when a program launches the first Cuda kernel. you should first launch a blank kernel when you check the running time of your kernels
Upvotes: 4
Reputation: 4713
I haven't worked in this setup, but most likely in the first run the kernel needs to compile. Shaders for GPU must be compiled in runtime since each devise compiles it a bit differently. Otherwise, you'd have to make as many executables as there are devises, plus different variation for each OS and whatever else contributes to the code compilation (driver version).
Upvotes: -1
Reputation: 76529
Probably because your GPU/CPU is increasing its clock speed because it has work to do. OS scheduling might also interfere, but that's not the main thing you're experiencing here.
Timing code execution time like this usually means at least averaging over multiple runs, and if you want to do better, excluding outliers.
I'm sure that if you add a few more doStuff(1e6);
lines, they will lie closer to the second result than the first.
Upvotes: 0