Reputation: 9235
I'm writing a simple OpenCL application using a very basic kernel. I have only one workgroup, and I am attempting to vary the number of work items. I've noticed that when I use only the CPU, I can have any number of work items. However, when I use only the GPU, it seems I can only have 512,1024,2048,... work items. 256 will generate errors, as will any number that isn't a power of two.
I've found this experimentally, but how can I find this information programmatically, presumably from the OpenCL C++ API?
Upvotes: 3
Views: 901
Reputation: 2844
The wavefront/warp size is a kernel parameter and not a device parameter (altough it is in fact hardware imposed), so you can query the wavefront/warp size by using clGetKernelWorkGroupInfo()
with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
.
See the documentation at: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html
NOTE: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
was introduced in OpenCL 1.1.
Upvotes: 1
Reputation: 2933
There is a device limit for the workgroup size, and for a given kernel depending on its resource usage. You can query the maximum possible workgroup size for a device with clGetDeviceInfo() with CL_DEVICE_MAX_WORK_GROUP_SIZE. For a kernel, you can get the maximum workgroup size with clGetKernelWorkGroupInfo() using CL_KERNEL_WORK_GROUP_SIZE.
As for smaller sizes, on the GPU, the workgroup size must be a multiple of the wavefront/warp size. This is 32 on Nvidia, and 64 for most AMD GPUs (but 32 for some). You can query the warp size using Nvidia's cl_nv_device_attribute_query (which provides the option CL_DEVICE_WARP_SIZE_NV for clGetDeviceInfo()), but there's no good way to get it on AMD. I just assume it is 64.
Additionally, the global work size must be divisible by the workgroup size in each dimension. It's usually best to round up your global size to some multiple of the workgroup size, and then avoid out of bounds workitems in your kernel.
Upvotes: 4