Multiple kernels in a single program vs. one kernel per program

Question

What is the actual difference of putting multiple kernels in a single program, or compiling a different program for each kernel, excluding source code organization? Specifically, is the register pressure dictated by the size of the program or by the actual kernel that is chosen within the program? Is the sum of all __local storage of all kernels allocated for the run of any of the kernels? Is there any other performance-related observation to make (e.g. code upload size to device, etc.)?

Tim · Accepted Answer

This could be device specific, and I speak from Intel GPU experience. Program-scope resources will only be visible to kernels in that program. Beyond that register allocation is per-kernel; hence, 1 kernel in K programs vs. K kernels in 1 program has no effect on register pressure. You do build and link per-program. Hence, compiling K kernels in one program is less efficient in terms of startup time if you don't use all the of K kernels.

Multiple kernels in a single program vs. one kernel per program

Answers (1)

Related Questions