Reputation: 972
Consider this pseudo code:
VkCommandBuffer mainCmdBuffer = vkAllocateCommandBuffers ();
VkBuffer stagingBuffer = vkCreateBuffer (VK_BUFFER_USAGE_TRANSFER_SRC_BIT);
vkBindBufferMemory (stagingBuffer, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT);
void* hostMemory = vkMapMemory (stagingBuffer);
VkBuffer vertexBuffer = vkCreateBuffer (VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT);
vkBindBufferMemory (vertexBuffer, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);
for (;;) {
aquire_frame ();
vkBeginCommandBuffer (mainCmdBuffer);
update_vertices (hostMemory);
vkCmdCopyBuffer (mainCmdBuffer, stagingBuffer, vertexBuffer, VK_WHOLE_SIZE);
VkBufferMemoryBarrier bufferBarrierMemory = {};
bufferBarrierMemory.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
bufferBarrierMemory.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
bufferBarrierMemory.dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
bufferBarrierMemory.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
bufferBarrierMemory.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
bufferBarrierMemory.buffer = vertexBuffer;
bufferBarrierMemory.offset = 0;
bufferBarrierMemory.size = VK_WHOLE_SIZE;
//Is this needed?
vkCmdPipelineBarrier (mainCmdBuffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_VERTEX_INPUT_BIT, 0,
0, nullptr, 1, &bufferBarrierMemory, 0, nullptr);
begin_render_pass ();
vkCmdBindPipeline ();
vkCmdBindVertexBuffers (vertexBuffer);
vkCmdDraw (VK_WHOLE_SIZE);
end_render_pass ();
vkEndCommandBuffer (mainCmdBuffer);
vkQueueSubmit ();
vkQueuePresentKHR ();
vkQueueWaitIdle ();
}
I searched a lot and tested a lot but cannot find a conclusive answer. Most other code I found (e.g. all of SaschaWillems examples) does not use a barrier when updating a vertex/index buffer. They use one e.g. when using an indirect command buffer after a compute-shader stage. On the other hand ... most tutorials I read say that a barrier is needed including official ones from Intel or Nvidia. So what is it?
I guess the answer is "yes just emit a barrier" and I thought so too but after doing some tests I realized that there is something more that I don't understand. Hear me out:
At first I used a HostVisible vertex-buffer, filled that with data (250k vertices) and drew the whole buffer in a loop. This takes around 0.424ms per frame with my setup.
Then I used a DeviceLocal vertex-buffer, filled that with the same data before the main loop once and drew this buffer in a loop. This takes around 0.300ms per frame with my setup. For me this is expected behavior since the HostVisible vertex-buffer needs to be transferred to the GPU every frame which takes some time --> making it slower.
At last I used a DeviceLocal vertex-buffer, but filled it with all data every frame in the loop. This takes around 0.550ms per frame which was confusing to me at first since it takes longer than the HostVisible version. So I also measured the transfer part alone, which clocks in at 0.250ms per frame which makes the complete time basically t(transfer) + t(drawing from device local storage) --> makes sense since I was using a barrier. The HostVisible version is probably faster since the transfer and drawing happens at the same time (that is my guess). So far so good.
But than I tested further and removed the barrier, expecting the time to be close to the HostVisible version and seeing some visual artifacts or maybe even crashing the GPU. But no ... nothing happened differently. Everything renders as before and the duration also did not improve --> which I don't understand. I thought all commands pushed to a queue start sequentially but finish in an unspecified order, meaning they are executed in parallel. So the transfer and the draw should run at the same time if I don't use a barrier?! If that is not the case and all commands are executed sequentially --> why use a buffer-barrier at all?
I further tested this by uploading all triangles in red (color), staging all in green, do a vkCmdCopyBuffer without a barrier and immediately afterwards draw a single frame --> expecting some of the triangles to still be red. Nope --> all of them are green. So I thought ... well since the transfer is faster than the render it is still possible for these commands to execute in parallel and giving this result. So I did yet another test and changed the color of all triangles but just drew the last one expecting at least this one to be red. Nope --> again it is green --> already updated, meaning the transfer has finished without using a barrier.
So ... I guess a barrier is not needed? But why? If the transfer and the render do not run at the same time why and when do I need a buffer-barrier at all? And If I do need one in this case and just my nvidia-driver is not acting according to spec --> why would they do that? Why did they implement a wait between transfer and render (against the spec) and slow things down? Or is my test-case flawed?
Thank you!
Upvotes: 1
Views: 810
Reputation: 474376
I thought all commands pushed to a queue start sequentially but finish in an unspecified order, meaning they are executed in parallel.
Your use of the word "are" is the source of your confusion. They "can be" executed in parallel, but that doesn't mean they will be. The fact that a thing works on a Vulkan implementation (or even all implementations) is not enough evidence to say that your code is correct. This is why validation layers are so important.
You do need some kind of synchronization between the transfer and the point when the data is read. I would use an external dependency between the subpass reading the data (specifically the vertex input stage of that subpass) and the transfer operation. And it needs a memory dependency that covers that memory too, for the purpose of reading vertex data.
Indeed, if you haven't set any external dependencies for an attachment, your render pass will generate one automatically. And this is probably what makes your program "work". The implementation probably issues a full barrier between the render pass and any previous commands (even though the automatic external dependency doesn't include the vertex input stage). And the vertex input caches were probably cleared at some point, which prevents you from accidentally seeing old data.
However, the actual text of the implicit external dependency does not actually cover this use case. So if your code is "working", it is only by accident. So you still need an explicit dependency.
Your initial code (where you wrote data to host-visible memory that the GPU read directly) worked because submitting a batch implicitly synchronizes host writes globally, for all uses, so long as those host writes happen before the submit operation.
Upvotes: 1