Reputation: 558
I'm calling the kernel below with GlobalWorkSize 64 4 1
and WorkGroupSize 1 4 1
with the argument output
initialized to zeros.
__kernel void kernelB(__global unsigned int * output)
{
uint gid0 = get_global_id(0);
uint gid1 = get_global_id(1);
output[gid0] += gid1;
}
I'm expecting 6 6 6 6 ...
as the sum of the gid1
's (0 + 1 + 2 + 3). Instead I get 3 3 3 3 ...
Is there a way to get this functionality? In general I need the sum of the results of each work-item in a work group.
EDIT: It seems it must be said, I'd like to solve this problem without atomics.
Upvotes: 0
Views: 463
Reputation: 5087
You need to use local memory to store the output from all work items. After the work items are done their computation, you sum the results with an accumulation step.
__kernel void kernelB(__global unsigned int * output)
{
uint item_id = get_local_id(0);
uint group_id = get_group_id(0);
//memory size is hard-coded to the expected work group size for this example
local unsigned int result[4];
//the computation
result[item_id] = item_id % 3;
//wait for all items to write to result
barrier(CLK_LOCAL_MEM_FENCE);
//simple O(n) reduction using the first work item in the group
if(local_id == 0){
for(int i=1;i<4;i++){
result[0] += result[i];
}
output[group_id] = result[0];
}
}
Upvotes: 2
Reputation: 6343
Multiple work items are accessing elements of global
simultaneously and the result is undefined. You need to use atomic operations or write unique location per work item.
Upvotes: 2