Reputation: 11496
I have the following pipeline:
Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.
glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.
glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.
ARB_sync Here it is said it is not recommended because it heavily impacts performance.
glMemoryBarrier This one is interesting. But it simply doesn't work.
Here is example of the code:
glMemoryBarrier(GL_ALL_BARRIER_BITS);
And also tried:
glTextureBarrierNV()
The code execution goes like this:
//rendered into the fbo...
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data
//cuda maps the image to CUDA buffer here..
Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier
set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)
after the compute shader stage. It doesn't sync! Only when I replace with glFinish
,or with any other operation which completely stall the pipeline.
Like glMapBuffer()
, for example.
So should I just use glFinish(), or I am missing something here?
Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?
UPDATE
I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier
doesn't sync. What I want to know is:
If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?
Upvotes: 6
Views: 1294
Reputation: 585
I know this is an old question, but if any poor soul stumbles upon this...
First, the reason glMemoryBarrier
does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.
Second, the only other way outside of glFinish
is to use glFenceSync
in combination with glClientWaitSync
:
....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...
This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.
Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore
, cuWaitExternalSemaphoresAsync
, and cuSignalExternalSemaphoresAsync
.
Upvotes: 3