Reputation: 323
I have started working on OpenCl and have some basic knowledge about how WorkGroups and kernel works. Suppose I have a vector of size 1024, and the WorkGroupSize of my GPU is 256. So my WorkGroupSize is a multiple of my VectorSize and this works pretty well as an example. But in real world scenarios, the VectorSize is not completely divisible by WorkGroupSize. So how to deal wit such problems? Is there any way to pass null values to make the VectorSize completely divisible by WorkgroupSize?
Upvotes: 0
Views: 1142
Reputation: 14825
You do not need to fill yourself the WorkGroup: Queueing a kernel for less than the maximum Work-items per work-group is fine.
So for example, if you have 1100 items, you could work in groups of: [256, 256, 256, 256, 76] and this will run as fast as 5 groups of 256 (1280 items).
Obviously, if your run 6 smaller groups [200, 200, 200, 200, 200, 100], it will be slower.
Upvotes: 1
Reputation: 72348
It is absolutely possible to pad input buffers to be round multiplies of the workgroup size you select for your kernel. However, it often isn't practical just because you need to have a algorithm which can naturally handle uninitialized or extra invalid data without error.
A far simpler solution is just to pass the input buffer length as an argument and then enclose the calculation code in an if statement based on the thread index, something like:
__kernel void kernel(....., unsigned int N)
{
unsigned int tid = get_global_id(0);
if (tid < N) {
/* kernel buffer access goes here */
}
}
This doesn't cause significant performance penalties because the conditional statement will evaluate uniformly across every workgroup except one. You then round up the number of workgroups you launch by one to ensure the whole input buffer is processed.
Upvotes: 2