Reputation: 5188
What is the actual difference of putting multiple kernels in a single program, or compiling a different program for each kernel, excluding source code organization? Specifically, is the register pressure dictated by the size of the program or by the actual kernel that is chosen within the program? Is the sum of all __local
storage of all kernels allocated for the run of any of the kernels? Is there any other performance-related observation to make (e.g. code upload size to device, etc.)?
Upvotes: 4
Views: 748
Reputation: 2796
This could be device specific, and I speak from Intel GPU experience. Program-scope resources will only be visible to kernels in that program. Beyond that register allocation is per-kernel; hence, 1 kernel in K programs vs. K kernels in 1 program has no effect on register pressure. You do build and link per-program. Hence, compiling K kernels in one program is less efficient in terms of startup time if you don't use all the of K kernels.
Upvotes: 3