CUDA atomics causes branch divergence

Question

I'm developing a CUDA kernel to compute the histogram of an image (NVIDIA GTX 480). I've noticed using the cuda profiler that an 82.2% of branch divergence was found. The profiler indicates the following function as the source of the divergence, located in a file named device_functions.h (in particular the line containing the return statement).

static __forceinline__
unsigned int __uAtomicAdd(unsigned int *p, unsigned int val)
{
  return __nvvm_atom_add_gen_i((volatile int *)p, (int)val);
}

Is it correct to say that atomic operations cause branch divergence?

Robert Crovella · Accepted Answer

To some degree atomic implementation in CUDA may vary by GPU architecture. But specifically for the GTX 480 (a Fermi-class GPU), __shared__ memory atomics are implemented not as a single machine instruction, but in fact by a sequence of machine (SASS) instructions that form a loop.

This loop is essentially contending for a lock. When the lock is acquired by a particular thread, that thread will then complete the requested memory operation atomically on the identified shared memory cell, and then release the lock.

The process of looping to acquire the lock necessarily involves branch divergence. The possibility for branch divergence in this case is not evident from the C/C++ source code, but will be evident if you inspect the SASS code.

Global atomics are generally implemented as a single (ATOM or RED) SASS instruction. However global atomics may still involve serialization of access if executed by multiple threads in the warp. I wouldn't normally think of this as a case of "divergence" but I'm not entirely sure how the profiler would report it. If you ran an experiment that involved only global atomics, I think it would become clear.

It's possible that the reported divergence in your case is entirely due to the shared memory divergence (which is expected) as described above.

CUDA atomics causes branch divergence

Answers (1)

Related Questions