Reputation: 131445
When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, occupancy of the SM where the kernel is being launched, size of the kernel's arguments etc.
For the sake of this question, lets be inclusive, and define "overhead" as the sum of the following time intervals:
Start: An SM sees the launch instruction
End: An SM starts executing an instruction of the sub-kernel
plus
Start: Last SM executes any instruction of the sub-kernel (or perhaps last write by a sub-kernel instruction is committed to the relevant memory space)
End: Execution of the next instruction of the parent after the sub-kernel launch.
Upvotes: 3
Views: 382
Reputation: 2916
This answer is not based on experiments or knowledge of the device - side runtime implementation, rather a thought on what needs to be done to perform the operation.
I assume the grid configuration and register usage of the launching has some effect as the state needs to be saved somewhere to have the SM move on to another kernel. Also, the number of blocks launched may have some impact as I don't see how the device runtime handle all configurations. On the other hand, I don't see why the callee register usage/code size would have huge impact.
Again, no test/experiment is here to prove any of the above.
Upvotes: 1