Well, for example I have some array Y and I want to increment Y[0] in multiple threads.
If I only make Y[0]++ in __global__ function then Y[0] will be 1.
So, how to resolve this?
Atomic operations are implementation dependent. If this compiles with no warnings, it is likely to work, but should be tested :-), or at least examine the assembler.
__global__ void mykernel(int *value){
int my_old_val = atomicAdd(value, 1);
}