Can separate CUDA kernels be merged using __syncthreads()?

Question

Say I have this toy code:

#define N (1024*1024)
#define M (1000000)

__global__ void cudakernel1(float *buf)
{
   int i = threadIdx.x + blockIdx.x * blockDim.x;
   buf[i] = 1.0f * i / N;
   for(int j = 0; j < M; j++)
      buf[i] *= buf[i];
}

__global__ void cudakernel2(float *buf)
{
   int i = threadIdx.x + blockIdx.x * blockDim.x;
   for(int j = 0; j < M; j++)
      buf[i] += buf[i];
}

int main()
{
   float data[N];
   float *d_data;
   cudaMalloc(&d_data, N * sizeof(float));
   cudakernel1<<>>(d_data);
   cudakernel2<<>>(d_data);
   cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
   cudaFree(d_data); 
}

Can I merge the two kernels like so:

#define N (1024*1024)
#define M (1000000)

__global__ void cudakernel1_plus_2(float *buf)
{
   int i = threadIdx.x + blockIdx.x * blockDim.x;
   buf[i] = 1.0f * i / N;
   for(int j = 0; j < M; j++)
      buf[i] *= buf[i];

   __syncthreads();

   for(int j = 0; j < M; j++)
      buf[i] += buf[i];
}

int main()
{
   float data[N];
   float *d_data;
   cudaMalloc(&d_data, N * sizeof(float));
   cudakernel1_plus_2<<>>(d_data);
   cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
   cudaFree(d_data); 
}

Is the general case that two consecutive kernels which take the same block and thread parameters can be merged with an intermediate __syncthreads() true?

(My real case is 6 consecutive non-trivial kernels that have a lot of set-up and tear-down overhead).

Robert Crovella · Accepted Answer

The simplest, most general answer is no. I only need to find one example for which the paradigm breaks to support that. Let's remind ourselves that:

__syncthreads() is a block level execution barrier, but not a device-wide execution barrier. The only defined device-wide execution barrier is the kernel launch (assuming we're talking about issuing kernels into the same stream, for sequential execution).
threadblocks of a particular kernel launch can execute in any order.

Let's say we have 2 functions:

reverse the elements of a vector
Sum the vector elements

Let's assume the vector reversal is not an in-place operation (the output is distinct from the input), and that each threadblock handles a block-sized chunk of the vector, reading the elements and storing to the appropriate location in the output vector.

To keep it really simple, we'll imagine we only have (need) two threadblocks. For the first step, block 0 copies the left hand side of the vector to the right hand side (reversing the order) and block 1 copies right-to-left:

1 2 3 4 5 6 7 8
|blk 0 |blk 1  |
     \ | /
       X
      /| \
     v |  v
8 7 6 5 4 3 2 1

For the second step, in classical parallel reduction fashion, block zero sums the left hand elements of the output vector, and block 1 sums the right hand elements:

8 7 6 5 4 3 2 1
  \  /   \  /
  blk0    blk1
   26      10

As long as the first function is issued in kernel1 and the second function is issued in kernel2, into the same stream after kernel1, this all just works. For each kernel, it does not matter if block 0 executes before block 1, or vice-versa.

If we combine the operations so that we have a single kernel, and block 0 copies/reverses the first half of the vector to the second half of the output vector, then executes a __syncthreads(), then sums the first half of the output vector, things are likely to break. If block 0 executes before block 1, then the first step will be fine (copy/reversal of vector) but the second step will be operating on an output array half that has not been populated yet, because the block 1 has not begun executing yet. The computed sum will be wrong.

Without trying to give formal proofs, we can see that in the above case where there is data movement from one block's "domain" to another block's "domain", we run the risk of breaking things, because the previous device-wide sync (kernel launch) was necessary for correctness. However, if we can limit the "domain" of a block so that any data consumed by subsequent operations is produced only by previous operations in that block, then a __syncthreads() may be sufficient to allow this strategy with correctness. (The previous silly example could easily be reworked to allow this, simply by having block 0 be responsible for the first half of the output vector, thus copying from the second half of the input vector, and vice versa for the other block.)

Finally, if we limit data scope to a single thread, then we can make such combinations without even using __syncthreads(). These last two cases might have characteristics of "embarassingly parallel" problems, which exhibit a high degree of independence.

Can separate CUDA kernels be merged using __syncthreads()?

Answers (1)

Related Questions