Reputation: 4459
In Cuda AtomicAdd for double can be implemented using a while loop and AtomicCAS operation. But how could I implement an atomic add for type int3 efficiently?
Upvotes: 3
Views: 354
Reputation: 151799
After further consideration, I'm not sure how an atomicAdd on an int3
would be any different than 3 separate atomicAdd
operations, each on an int
location. Why not do that?
(An int3
cannot be loaded as a single quantity anyway in CUDA at the machine level. The compiler is guaranteed to split that into multiple loads, so although there would be a hazard to asynchronously read the int3
, that hazard would be there anyway, with or without atomics.)
But to answer the specific question you asked, it's not possible using atomics.
int3
is a 96-bit type.
CUDA atomics support operations up to 64 bits only. Here is an atomic add example for float2
(a 64-bit type) and you could do something similar for up to e.g. short3
or short4
.
You could alternatively use a reduction method or else a critical section. There are plenty of questions here on the SO cuda
tag that discuss reductions and critical sections.
A reduction method could be implemented as follows:
Each thread that wants to make an atomic update to a particular int3
location uses this method to create a queue or list of the atomic update quantities.
Once the list generation is complete, launch a kernel to do a parallel reduction on the list, so as to produce the final reduced quantity that belongs in that location.
Upvotes: 2