Programmer
Programmer

Reputation: 6753

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong

Upvotes: 33

Views: 25741

Answers (3)

pgplus1628
pgplus1628

Reputation: 1374

The accepted answer is not always correct.

In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.

  • environment variable CUDA_LAUNCH_BLOCKING equals to 1.
  • using a profiler(nvprof), without enabling concurrent kernel profiling
  • memcpy that involve host memory which is not page-locked.

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.

From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).

Upvotes: 17

Yappie
Yappie

Reputation: 399

Concurrent kernel execution is supported since 2.0 CUDA capability version.

In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.

In this case, you can provide synchronization yourself.

Upvotes: -3

jmsu
jmsu

Reputation: 2053

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.

This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.

Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

Upvotes: 47

Related Questions