Reputation: 508
I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter
buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter
buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global
arguments and I need to change it to __private
? Will I be able to aggregate __private
buffers from the host code afterwards?
Upvotes: 1
Views: 462
Reputation: 23428
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the
hiddenCounter
buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics
and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private
or local
memory and writing them out to global memory at the end of the kernel may be faster. (In the local
case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter
buffer. So you'll need to work out/try which is the lesser evil in your case.
Upvotes: 1