Reputation: 793
I'm new with OpenCL and I'm trying to understand this example program written by Apple here.
The goal of the program is to calculate the square of each element of an input array and write the result in a new array.
You can see that the input array has dimension: 1024. The number of work groups is 1024 and the size of each of those is the max CL_KERNEL_WORK_GROUP_SIZE.
Can anybody explain me what's the point of using so many work-items in each work group if in the Kernel there's no get_local_id() call? Could they use 1 as the size of each work group? what would have been the difference?
Thanks.
Some code to show the point:
// Get the maximum work group size for executing the kernel on the device
//
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
// Execute the kernel over the entire range of our 1d input data set
// using the maximum number of work group items for this device
//
global = count;
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
Upvotes: 1
Views: 2692
Reputation: 333
Your global work size is executed in chunks of local work size (in theory), if you set 1 as your local work group size, then it would execute only 1 thread in each local work group. On GPUs, work groups match to compute units - if you have a work group size of 1, your 1 thread may potentially occupy a whole compute unit. This is really, really horribly slow
Upvotes: 2