Reputation: 688
I know this question is asked several times, but in my application its critical to have the time right, so i might want to try again:
I calculate the time for a kernel Method like this, first for CPU Clock time with clock_t;
clock_t start = clock(); // Or std::chrono::system_clock::now() for WALL CLOCK TIME
openCLFunction();
clock_t end = clock; // Or std::chrono::system_clock::now() for WALL CLOCK TIME
double time_elapsed = start-end;
And my openCLFunction():
{
//some OpenCLKernelfunction
clFlush(queue);
clFinish(queue);
}
There is a big different in results between two method, and to be honest i dont know which is right, because they are in miliseconds. Can i trust the CPU clock time on this ? Is there a definitive way to measure without concerning about the results ?(Note that I call two functions to finish my kernel function.)
Upvotes: 1
Views: 1115
Reputation: 132148
There are (at least) 3 ways to time OpenCL/CUDA execution:
Your first example falls in the first category, but - you don't seem seem to be flushing the queues which the OpenCL function uses (I'm assuming that's a function enqueueing a kernel). So - unless the execution is somehow forced to be synchronous, what you would be measuring is the time it takes to enqueue the kernel and do whatever CPU-side work you do before or after that. That could explain the discrepancy with the clFlush/clFinish method.
Another reason for the discrepancy could be setup/tear-down work (e.g. memory allocation or run-time internal overhead) which your second method times and your first does not.
A final note is that all three methods will produce slightly different results due to either measurement inaccuracy or differences in the overheads required to make use of them. These differences may not be so slight if your kernels are small, though: In my experience, profiler-provided kernel execution times vs event-measured times, in CUDA and on nVIDIA Maxwell and Pascal cards can differ by dozens of microseconds. And the lessons of that fact are:
Upvotes: 3
Reputation: 20396
You should probably be using Kernel profiling.
cl_command_queue_properties properties[] {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, properties, &err);
/*Later...*/
cl_event event;
clEnqueueNDRangeKernel(queue, kernel, /*...*/, &event);
clWaitForEvents(1, &event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, nullptr);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, nullptr);
std::chrono::nanoseconds duration{end - start};
At the end of that code, duration
contains the amount of nanoseconds (reported as precisely as the device is capable; note that many devices don't have sub-microsecond precision) that passed between the beginning and end of execution of the kernel.
Upvotes: 2