Reputation: 329
I've shortly started using OpenCL to write programs for GPUs. I'm familiar with basic concepts that are required to write efficient programs in OpenCL, like work-items, work-groups, global-item-size, barriers, etc.
One of my programs involved making about 20 million work-groups with 360 work-items in each work-group. However, for some reason OpenCL couldn't handle that many number of work-groups. All elements of my output array simply remained 0. In addition, OpenCL didn't even start the calculations when I called clEnqueueNDRangeKernel(), since when I viewed the GPU usage stats I didn't see a "spike" that usually happens when I run an OpenCL kernel. I attempted to reduce the number work-groups, to see what is the maximum number of work-groups. It was 5965232 and it is always 5965232. Not more, not less.
I know that the problem is NOT with the number of work-items. It is with the number of work-groups. To prove this, here is my original code, where LIST_SIZE is 360.
global_item_size = 5965232*LIST_SIZE;
local_size = LIST_SIZE;
and a modified version of my code:
global_item_size = 5965232*LIST_SIZE*1.3;
local_size = LIST_SIZE*1.3;
In all the scenarios, the number of work-groups limit was 5965232.
I'm trying to find out what causes this limit and how to check this limit. I understand that there may be a limitation, but what causes this limitation and how can I check check this limit number in OpenCL? I've did a lot of research, but all sites are talking about work-group size limits and not about number of work-group limits.
I'm using the Intel Graphics HD 4000 GPU with an i5-3320M. It has 32 MB of integrated RAM.
Upvotes: 1
Views: 153
Reputation: 5746
5965232*320 = 2147483520 < 2147483647 = 2^31-1 = maximum 32-bit signed integer value
You are dealing with a classical 32-bit integer overflow in the multiplication in line
global_item_size = 5965232*LIST_SIZE;
Try global_item_size = 5965232ull*(uint64_t)LIST_SIZE;
instead. Make sure global_item_size
is data type uint64_t
.
Upvotes: 2