Reputation: 93
I have read Parallel Thread Execution ISA: Parallel Synchronization and Communication Instructions: bar, barrier which details PTX synchronization functions.
It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter a
. What is a barrier logical resource?
I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside asm
and what "memory"
does. I assume name
replaces %0
and numThreads
replaces %1
, but what is "memory"
and what are the colons doing?
__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with the barrier.sync
function?
For an example, say I have a grid of 1 block, block of 256 threads. We split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the namedBarrierSync
function defined above: namedBarrierSync(1, 64)
. Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?
I have tested with below code (assume that gpuAssert()
is an error checking function defined somewhere in the file).
Here is the code:
__global__ void test(int num_threads)
{
if (threadIdx.x >= 64 && threadIdx.x < 128)
{
namedBarrierSync(0, num_threads) ;
}
__syncthreads();
}
int main(void)
{
test<<<1, 1, 256>>>(128);
gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
printf("complete\n");
return 1;
}
Upvotes: 2
Views: 3753
Reputation: 3095
There is another bug in test()
that wasn't mentioned by @tera or in the cross-post:
__syncthreads()
implicitly uses the same barrier resource/name 0
(See Compiler Explorer) which means that threads that don't execute the conditional block will still try to participate in the same barrier but with a different number of participating threads which causes undefined behavior. In practice the kernel seems to complete successfully on older architectures like Pascal (even the desired synchronization might not have been achieved) while I get an
an illegal instruction was encountered
runtime error at the next cudaDeviceSynchronize()
on newer architectures like Turing and Ada Lovelace. On either architecture compute-sanitizer --tool synccheck
will report
Barrier error detected. Divergent thread(s) in block.
This can be fixed by using a different barrier resource/name in the conditional block, e.g. namedBarrierSync(1, num_threads);
.
Upvotes: 1
Reputation: 7265
barrier.sync
with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).test<<<1, 256>>>(128);
instead.Upvotes: 5