Failing to read an array from the global mem CUDA

Question

I am implementing a complicated algorithm on CUDA. But there is a really odd problem. The problem can be summarised as following: the kernel will repeat a series of calculation many times. The calculation of the present iteration is upon the result of the previous one. I am using an array on the global memory for passing information between blocks in each iteration. For example there are 2 blocks, for each iteration block 0 saves the result to the global memory, then block 1 read it from the global memory. However the problem is that the block 1 can’t read the array from the global memory. it sometimes returns the result of the 1st iteration, not the previous one.

a_e and e_a are two arrays on the global mem, the size is [2*8]. d_a_e and d_e_a are on the shared mem, the size is [blockDim.x+1][8].

if(threadIdx.x<8)
{
     //block 0 writes, block 1 reads, this can't work properly
     a_e[blockIdx.x*8+threadIdx.x]=d_a_e[blockDim.x][threadIdx.x];
     if(blockIdx.x>0)
          d_a_e[0][threadIdx.x]=a_e[(blockIdx.x-1)*8+threadIdx.x];

     //block 1 writes, block 0 reads, this can work properly
     e_a[blockIdx.x*8+threadIdx.x]=d_e_a[0][threadIdx.x];
     if(blockIdx.x < gridDim.x-1)
          d_e_a[blockDim.x][threadIdx.x]=e_a[(blockIdx.x+1)*8+threadIdx.x];
}

alrikai · Accepted Answer

This setup won't work; you're effectively trying to serialize your blocks, which as talonmies alluded to in his comment, doesn't work. From the CUDA programming guide:

"Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores..." http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

Your best recourse if probably to launch seperate kernels (such that you perform the block 0 computation in the 1st kernel, block 1 in the 2nd kernel, etc) to try to enforce that the results from the 1st kernel are done before reading them in the next kernel. There has been some work done on have inter-block synchronization, but you wouldn't derive much benefit from them, as you need to serialize your blocks.

EDIT: I should also point out that the block scheduling isn't documented, and is liable to change at any point, so any inter-block synchronization will be non-portable and liable to break on a driver or CUDA toolkit update.

Failing to read an array from the global mem CUDA

Answers (1)

Related Questions