meowstronaut
meowstronaut

Reputation: 69

Do I need more events when timing multiple work-items?

If I have more than one work-item to execute some kernel code, do I need to have more events to track the execution time for each work-item?

I have some strange results, 1 work-item takes about 4 seconds to execute and 100 work-items also take about 4 seconds to execute. I can't see how this could be possible since my Nvidia GeForce GT 525M only has 2 compute units, each with 48 processing elements. This leads me to believe the event I listed as an argument in clEnqueueNDRangeKernel tracks only one work-item. Is that true and if so, how can I get it to track all the work-items?

This is what the Khronos user guide says about the event argument in clEnqueueNDRangeKernel:

event returns an event object that identifies this particular kernel execution instance

What is the meaning of "this particular kernel execution instance"? Isn't that a single work-item?

EDIT: Relevant host code:

static const size_t numberOfWorkItems = 48;
const size_t globalWorkSize[] = { numberOfWorkItems, 0, 0 };

cl_event events;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, &events);
ret = clEnqueueReadBuffer(command_queue, memobj, CL_TRUE, 0, sizeof(cl_mem), val, 0, NULL, NULL);

clWaitForEvents(1, &events);
cl_ulong time_start;
cl_ulong time_end;

clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &time_start, NULL);
clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &time_end, NULL);

double nanoSeconds = (double) (time_end - time_start);
printf("OpenCl Execution time is: %f milliseconds \n",nanoSeconds / 1000000.0);
printf("Result: %lu\n", val[0]);

Kernel code:

kernel void parallel_operation(__global ulong *val) {
    size_t i = get_global_id(0);
    int n = 48;
    local unsigned int result[48];
    for (int z = 0; z < n; z++) {
        result[z] = 0;
    }
    // here comes the long operation
    for (ulong k = 0; k < 2000; k++) {
        for (ulong j = 0; j < 10000; j++) {
            result[i] += (j * 3) % 5;
        }
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    if (i == 0) {
        for (int z = 1; z < n; z++) {
            result[0] += result[z];
        }
        *val = result[0];
    }
}

Upvotes: 1

Views: 53

Answers (1)

Conradin
Conradin

Reputation: 180

You are measuring the execution time of your entire kernel function. Or in other words, the time between the first work-item starts and the last work-item finishes. To my knowledge there is no possibility to measure the execution time of one single work-item in OpenCL.

Upvotes: 2

Related Questions