What factors effect the overhead of dynamic parallelism kernel launches?

Question

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, occupancy of the SM where the kernel is being launched, size of the kernel's arguments etc.

For the sake of this question, lets be inclusive, and define "overhead" as the sum of the following time intervals:

Start: An SM sees the launch instruction
End: An SM starts executing an instruction of the sub-kernel

plus

Start: Last SM executes any instruction of the sub-kernel (or perhaps last write by a sub-kernel instruction is committed to the relevant memory space)
End: Execution of the next instruction of the parent after the sub-kernel launch.

What factors effect the overhead of dynamic parallelism kernel launches?

Answers (1)

Related Questions