How does glUseProgram() work at the hardware level?

Question

I think(I'm not sure), whenever we call glUseProgram(), the vertex shader, frag shader, geometry shader etc are loaded in program memory of the respective type processors in the GPU (e.g vertex shader -> vertex shader processor).
But I also find it surprising how fast this 'loading' happens. Despite having a complex rendering pipeline and multiple shaders, games are often able to give very good FPS(1) which means program loading is probably happening thousands of times each second.
Am I right in thinking how this works? if yes, am I really underestimating how powerful GPUs are?
(Mini question related to (1)) Where can I find more information regarding correlation between internal GPU bus bandwidth vs performance(or FPS)?
another mini question) Are shaders compiled on client or on device?
Post compilation, are shader programs stored in GPU's(shared) memory or are processor caches big enough to hold many shaders?

Bartek Banachewicz · Accepted Answer

are loaded in program memory of the respective type processors in the GPU

I am pretty sure that the compiled shaders are kept "close enough" to the shading units to make loading itself quite efficient. The instruction caches contain the necessary data, and it's easy to rewrite them with data from VRAM. After all, there's no CPU sync needed, and everything happens entirely within the GPU.

An important thing to note here is that modern GPUs don't have "respective type processors" anymore; they really only use general-purpose shading units that can run various computations, including, but not limited to fragment and vertex shading.

Despite having a complex rendering pipeline and multiple shaders, games are often able to give very good FPS(1) which means program loading is probably happening thousands of times each second.

Yes, the modern games can have thousands of pipeline setups in order to draw a frame. GPUs are fast. Modern OpenGL made it easier to have more programs with Separate Shader Objects extension, which helps to make the rendering more modular.

(Mini question related to (1)) Where can I find more information regarding correlation between internal GPU bus bandwidth vs performance(or FPS)?

This is way too broad to answer and depends heavily on the workload. This is an interesting document, though, that might shed some light on your questions and possibly inspire you to do further research.

All in all, glUseProgram itself will typically do pretty much... nothing on a modern driver (in terms of perf). This is because drivers use a form of lazy evaluation and only actually commit the state changes when they are sure what drawcall is going to use them. Now, how efficient a driver is at optimizing out the unnecessary calls, reordering etc. depends entirely on the implementation.

Are shaders compiled on client or on device?

They are compiled, in the OpenGL's terms, on the server, but that doesn't necessarily mean the physical device. Typically it's the OS-resident part of the driver that does the shader compilation.

Post compilation, are shader programs stored in GPU's(shared) memory or are processor caches big enough to hold many shaders?

Both. Programs are stored in global memory and, if possible, in each processor's instruction cache, which is a few kilobytes. This depends on the size of the shaders and the size of the caches, but typically should fit a few. The cache is populated in a LRU fashion at runtime.

How does glUseProgram() work at the hardware level?

Answers (1)

Related Questions