user3532017
user3532017

Reputation: 11

OpenCL measure kernels time

I have a problem. I have two kernels in the loop, which is at the top. I want to see the total time of all running kernels, even though the first kernel will be run for example 10 times a second for example 5 times.

Thank you very much.

for (int arrayLength = minLengthArray; arrayLength <= N; arrayLength *= 2)
{
    int threadCount = 0;
    int batchSize = N / arrayLength;
    int  GroupCount = (batchSize * arrayLength) / SHARED_SIZE_LIMIT;
    size_t local = SHARED_SIZE_LIMIT / 2;
    size_t global = local * GroupCount;

    if (arrayLength <= SHARED_SIZE_LIMIT)
    {
        err = clEnqueueNDRangeKernel(commands, ddEvenMergeSortSharedKernel, 1, NULL, &global, &local, 0, NULL,  NULL);
    }
    else
    {
        clSetKernelArg(oddEvenMergeSortSharedKernel, 5, sizeof(unsigned int), &SHARED_SIZE_LIMIT);
        err = clEnqueueNDRangeKernel(commands, oddEvenMergeSortSharedKernel, 1, NULL, &global, &local, 0, NULL,  NULL);

        for (int size = 2 * SHARED_SIZE_LIMIT; size <= arrayLength; size <<= 1)
            for (unsigned stride = size / 2; stride > 0; stride >>= 1)
            {
                global = batchSize * arrayLength;
                local = 256;

                err = clEnqueueNDRangeKernel(commands, oddEvenMergeSortGlobalKernel, 1, NULL, &global, &local, 0, NULL,  NULL);
            }
    }
}

Upvotes: 1

Views: 2129

Answers (1)

Marco13
Marco13

Reputation: 54639

It partially depends on how to want to "schedule" the timing.

Some general hints:

You'll have to enable event profiling for your command queue, by passing the CL_QUEUE_PROFILING_ENABLE flag when you create your command queue:

commands = clCreateCommandQueue(
    context, device, CL_QUEUE_PROFILING_ENABLE, &err);

In your loop, you have to create events for your kernel calls. Here you have different options: You could either collect the events for the kernel calls (in some vector, list or array), or perform the timing of each event individually. The basic approach would be as follows, sketeched for the first kernel (error handling omitted)

cl_event event0;  // creating an event variable for timing 
clEnqueueNDRangeKernel(commands, ddEvenMergeSortSharedKernel, 
    1, NULL, &global, &local, 0, NULL, &event0); // Pass in event here

clWaitForEvents (1, &event0); // Wait for the event

// Obtain the start- and end time for the event
unsigned long start = 0;
unsigned long end = 0;
clGetEventProfilingInfo(event0,CL_PROFILING_COMMAND_START,
    sizeof(cl_ulong),&start,NULL);       
clGetEventProfilingInfo(event0,CL_PROFILING_COMMAND_END,
    sizeof(cl_ulong),&end,NULL);

// Compute the duration in nanoseconds
unsigned long duration = end - start;

// Don't forget to release the vent
clReleaseEvent(event0);

The duration (in nanoseconds) may then be accumulated accordingly. The same scheme could be used for the second kernel.

However, you should consider a few things:

  • There is some boilerplate code involved. Maybe you'd like to extract this into a convenience method, something like duration = processEvent(event0) that waits for the given event, then computes the event duration, releases the event and returns the computed duration
  • The fact that the code is waiting for the event may affect the runtime behavior in certain cases

And most importantly:

  • If you don't absolutely need this "programmatic access", you should consider to simply run your code in a profiler. This should give you the desired information, namely the time that has been spent for each kernel in the command queue.

Upvotes: 3

Related Questions