Arun AC
Arun AC

Reputation: 417

Why Compute Shader is slowing down the rendering API calls?

I am using compute shader to process the input buffer data and store it as output texture using imagestore().

After executing the compute shader, I have 3 render calls sequentially.

Compute Shader Code:

#version 310 es
precision mediump image2D;
layout(std430) buffer; // Sets the default layout for SSBOs 
layout(local_size_x = 256) in; // 256 threads per work group
layout(binding = 0) readonly buffer InputBuf 
{
    uint input_buff[];
} inputbuff;
layout (rgba32f, binding = 1 ) uniform writeonly image2D out_teximg;
void main()
{
    int idx = int(gl_GlobalInvocationID.x);
    int idy = int(gl_GlobalInvocationID.y);
    unsigned int inputpix = inputbuff[1024 * idy + idx];
    // some calculation on inputpix and output is rcolor, bcolor, gcolor
    imageStore(out_teximg, ivec2(idx , idy), vec4(rcolor, bcolor, gcolor, 1.0)); 
    barrier();
};

Code:

void initCompute()
{
    glGenTextures(1, &computeOutTex);
    glGenBuffers(1, &inSSBOId);
}

uint inputBuffData = { .... }; // input buffer data
void execute_compute()
{
    // compute shader code starts...
    glUseProgram(computePgmId);
    glActiveTexture(GL_TEXTURE0);
    glBindTexture(GL_TEXTURE_2D, computeOutTex);
    glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA32F, width, height);

    glBindImageTexture(1, computeOutTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F); // binding is 1 
    glUniform1i( glGetUniformLocation(computePgmId, "out_teximg"), 0);
    
    uint inputBuffSize = 1024 * 512 * 3;
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, inSSBOId);
    glBufferData(GL_SHADER_STORAGE_BUFFER, inputBuffSize, inputBuffData, GL_STATIC_DRAW);
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER,  0 , inSSBOId); // binding is 0

    glDispatchCompute(width / 256, height, 1);
    glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
    // glFinish();
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
    glBindImageTexture(1, 0, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);  // binding is 1 
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, 0);// binding is 0
}

int draw()
{
    glBindFramebuffer(GL_FRAMEBUFFER, m_FBOId); // Offscreen Rendering
    glClear(GL_COLOR_BUFFER_BIT);
    glUseProgram(compute_pgm); 
    execute_compute();

    glUseProgram(render_pgm1);
    glViewport(0,0,w,h);
    glActiveTexture(GL_TEXTURE0);
    glBindTexture(GL_TEXTURE_2D, computeOutTex);
    glDrawElements(); // Render the texture data

  // 2nd draw call 
    glUseProgram(render_pgm2);
    ....
    ....
    glDrawElements();

  // 3rd draw call
    glUseProgram(render_pgm3);
    ....
    ....
    glDrawElements(); 
    
    glBindFramebuffer(GL_FRAMEBUFFER, 0); // unbind FBO
}

Here, the only 2nd draw call is taking more time after using compute shader.

If glFinish() is called after glMemoryBarrier(), then only execute_compute() call is slowed down. Why compute shader is slowing down the subsequent draw calls? Is glFinish() really needed?

Upvotes: 1

Views: 997

Answers (1)

Rabbid76
Rabbid76

Reputation: 210890

The compute shader does not slow down the subsequent draw call. However, the compute shader itself takes some time to execute. Since you are setting a memory barrier, the subsequent draws have to wait.
The OpenGL commands are cached and are not executed immediately when they are called. GPU and CPU work in parallel. The CPU sends instructions to the GPU and the GPU processes them as quickly as possible.
glFinish gets everything ready and does not return until all previously called commands have been completed. glFinish itself is not "costly". It just seems "costly" when measuring the time on the CPU since it measures the time it takes to complete the previously called OpenGL commands.
Anyway glFinish is not needed here. All you need is the memory barrier. When using the memory barrier, the following OpenGL commands, which depend on this barrier, appear to take longer to complete. However they don't need any longer, they just have to wait until the condition indicated by the barrier is met.

In your case you need to use GL_ALL_BARRIER_BITS or GL_TEXTURE_FETCH_BARRIER_BIT, which reflects incoherent memory writes (e.g.: Image store) prior to the barrier to texture fetches after the barrier.

Upvotes: 2

Related Questions