Stefano Borini
Stefano Borini

Reputation: 143895

Device sync during async memcpy in CUDA

Suppose I want to perform an async memcpy host to device in CUDA, then immediately run the kernel. How can I test in the kernel if the async transfer has completed ?

Upvotes: 4

Views: 4709

Answers (1)

Jared Hoberock
Jared Hoberock

Reputation: 11416

Sequencing your asynchronous copy and kernel launch using a CUDA "stream" ensures that the kernel executes after the asynchronous transfer has completed. The following code example demonstrates:

#include <stdio.h>

__global__ void kernel(const int *ptr)
{
  printf("Hello, %d\n", *ptr);
}

int main()
{
  int *h_ptr = 0;

  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));

  // look for thirteen in the output
  *h_ptr = 13;

  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));

  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);

  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);

  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}

And the output:

$ nvcc -arch=sm_20 async.cu -run
Hello, 13

I don't believe there's any supported way to test from within a kernel whether some asynchronous condition (such as the completion of an asynchronous transfer) has been met. CUDA thread blocks are assumed to execute completely independently from other threads of execution.

Upvotes: 9

Related Questions