Reputation: 916
So I am pretty new to OpenCL, and I am trying to better understand work groups and work items. I understand that all of the threads (items) inside a single group share memory, atomic operations, and barrier synchronization.
However what if I don't need those benefits and only care about the global ID of any given thread?:
get_global_id(0)
How then should I go choosing how many groups and how many items each group should have if all I care about is the total # of threads? (= groups * items per group)
For example let's say I have a program to calculate a 400x400 matrix. I have 160,000 threads total. Originally I thought (naively) well let's stick them all in one block, however that was way above the allowed limit of threads per block. So then I choose an arbitrary # of blocks: 1600 with 100 threads per block. My speedup on average was x5.5 that of the CPU single threaded (I don't have a nice GPU to run my code on yet...). So then I figured well since I have no use for blocks, why not give every single thread its own block? My speedup was x4.5 on average. So it was slower to give each thread its own block.
How exactly is happening here, I presume that creating blocks has some extra overhead? How do I go about calculating the optimal amount of blocks I should have? Is the optimal solution simply to make as few blocks as possible?
Upvotes: 1
Views: 109
Reputation: 3625
One option is to give NULL
to local_work_size
parameter of the clEnqueueNDRangeKernel
, in which case the OpenCL implementation will decide the local size on its own. This might not give the optimal result, but at least the OpenCL implementation will try to guess the optimal local size.
In addition, clGetKernelWorkGroupInfo
can be used to query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
.
Upvotes: 2