Why doesn't CUDA synchronization point prevent race condition?

Question

We run the cuda-memcheck --tool racecheck on our code. We get the following memory hazard errors.

========= Race reported between Read access at 0x00004098 CUDA.cu:123:KernelFunction()
=========     and Write access at 0x00005058 in CUDA.cu:146:KernelFunction() [529996 hazards]

Here's the code. It claims that line 123 value = sharedMemory0[sharedMemoryIndex]; is in a race condition with line 146 sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];. We have

// Synchronization Point 1 
__syncthreads(); 
__threadfence_block();

between the two lines. Shouldn't all the threads synchronize at that point and all the previous memory read/writes complete at that point? All the threads and memory accesses should complete after the first j-loop before starting the second j-loop. So in our minds Synchronization Point 1 should isolate the two j-loops and prevent a race condition, but the tool says that's not true.

Why is the tool reporting a race condition? Any insights as to what we could do to prevent it?

We've also seen references to a tool that might be able to report a trace of the execution to more easily see the race condition. What tool and options can we use to get a trace to see more clearly why the race condition exists?

   for (i = 0; i < COUNT0; i++) {
       // Synchronization Point 0
       __syncthreads();
       __threadfence_block();
       for (j = 0; j < COUNT1; j++) {
          index = j*blockDim.x + threadIdx.x;
          if (index < THREAD_COUNT0) {
             for (k = 0; k < COUNT2; k++)
                sharedMemoryIndex = function0(index);
                value = sharedMemory0[sharedMemoryIndex];
             }
          }         
       }

       // Synchronization Point 1
       __syncthreads();
       __threadfence_block();
       for (j = 0; j < COUNT2; j++) {
          index = j*blockDim.x + threadIdx.x;
          if (index < THREAD_COUNT1) {
            sharedIndex0 = function1(index);
            sharedIndex1 = function2(index);
            sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];
          }
       }
    }

We've also run the Synccheck tool, cuda-memcheck --tool synccheck and it reported the following error on Synchronization Point 1. There's probably a strong correlation between the two errors, but there isn't very much documentation in the cuda-memcheck guide about what synchronization of divergent code is, why it's bad, and how to fix it.

Any comments?

========= Barrier error detected. Encountered barrier with divergent threads in block
=========     at 0x00004ad8 in CUDA.cu:139:KernelFunction()
=========     by thread (0,0,0) in block (8,0,0)

roger1994 · Accepted Answer

This code gives the same results independently of the number of threads used to perform the calculations. We ran the code with only a single thread and then ran the code with multiple threads. By definition, it's not possible for a single threaded version to run into a race condition. And yet, the single threaded version gave identical results to the multi-threaded version. cuda-memcheck --tool racecheck reports many, many race violations on the multi-threaded version. If there were race violations actually occurring, the multi-threaded results would not match exactly the single threaded results. Therefore, cuda-memcheck must be wrong and have bugs dealing with complex looping structures. cuda-memcheck was able to find race conditions in simple looping structures, just not in this complicated one.

Why doesn't CUDA synchronization point prevent race condition?

Answers (2)

Related Questions

Why doesn&#39;t CUDA synchronization point prevent race condition?

Answers (2)

Related Questions

Why doesn't CUDA synchronization point prevent race condition?