Caius Cosades
Caius Cosades

Reputation: 81

Vulkan compute shaders: Most efficient way to tranfer buffer to/from GPU? Retrieving the buffer seems to be slow

I am trying to use Vulkan now for some GPGPU work and am struggling to get decent performance on my GTX 1070.

Let's say I want to run a compute shader against a block of data and write the result into the same block.

To transfer the data to the GPU, I first create a host-visible staging buffer and a device local buffer, then I map the staging buffer's memory to the host's address space, copy the data to it, unmap the staging buffer, and copy the staging buffer to the device buffer.

  VkBuffer stagingBuffer;
  VkDeviceMemory stagingBufferMemory;
  createBuffer(size, VK_BUFFER_USAGE_TRANSFER_SRC_BIT,
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT,
    stagingBuffer, stagingBufferMemory);

  void* bufferData = nullptr;
  vkMapMemory(m_device, stagingBufferMemory, 0, size, 0, &bufferData);
  memcpy(bufferData, data, size);
  vkUnmapMemory(m_device, stagingBufferMemory);

  VkBufferUsageFlags usage = VK_BUFFER_USAGE_TRANSFER_DST_BIT
                           | VK_BUFFER_USAGE_TRANSFER_SRC_BIT
                           | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
  createBuffer(size, usage, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, m_buffer, m_bufferMemory);

  copyBuffer(stagingBuffer, m_buffer, size);

  vkDestroyBuffer(m_device, stagingBuffer, nullptr);
  vkFreeMemory(m_device, stagingBufferMemory, nullptr);

I then run the shader. To retrieve the result, I do as before but in reverse: Create a host-visible staging buffer, copy the device-local buffer to it, map the staging buffer memory into the host's address space, and memcpy it to where I need it.

  VkBuffer stagingBuffer;
  VkDeviceMemory stagingBufferMemory;
  createBuffer(m_bufferSize, VK_BUFFER_USAGE_TRANSFER_DST_BIT,
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT,
    stagingBuffer, stagingBufferMemory);

  copyBuffer(m_buffer, stagingBuffer, m_bufferSize);

  void* bufferData = nullptr;
  vkMapMemory(m_device, stagingBufferMemory, 0, m_bufferSize, 0, &bufferData);
  memcpy(data, bufferData, m_bufferSize);
  vkUnmapMemory(m_device, stagingBufferMemory);

  vkDestroyBuffer(m_device, stagingBuffer, nullptr);
  vkFreeMemory(m_device, stagingBufferMemory, nullptr);

Is there a faster way of doing this? Currently, the running time (as measured on the host side with the standard library's high resolution clock) is dominated by pulling the buffer back down to the host - so much so that it's considerably slower than if I'd done everything on the CPU.

Here are some example figures. This was running against a large buffer, which makes the difference quite extreme:

Running CPU benchmark...
Running time: 20.875 milliseconds

Running GPU benchmark...
Submit time = 41.817
Execution time = 1.066
Retrieval time = 223.479
Running time: 266.391 milliseconds

So to summarize the overall procedure at the moment:

What can be done to speed this up?

Upvotes: 2

Views: 1532

Answers (1)

Nicol Bolas
Nicol Bolas

Reputation: 474186

Your code shows a lot of cruft and bad practices.

Your code treats allocating and deallocating device memory as though it's a CPU heap. It's not. You should allocate device memory once and keep it around for reuse. If you need a larger allocation, then you may need to make one (or you should have allocated a larger buffer before), but overall, you should never allocate memory just for a single operation.

Also, never unmap memory unless you're about to delete it. There is no disadvantage to keeping host-visible memory mapped, and mapping it is not a free operation.

But the most important issue is what your code implies about its relationship to Vulkan devices. I don't see any direct usage of command buffers, so I must assume that functions like copyBuffer will create a CB and write a transfer operation to it. But I also don't see any synchronization operations or direct queue usage, which mean that such functions must also submit the CBs to the queue and wait until the queue operation has finished.

Never do this.

Queue submission operations are not cheap. This fact is so important that the Vulkan specification actually takes time out in the docs for vkQueueSubmit to say that. It's literally the first thing in the section on submitting command buffers:

Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit or vkQueueSubmit2 as possible.

So make sure to do that. If you have designed your Vulkan interfacing API so that this isn't possible, then you've made a mistake at the design level and need to readjust things.

Also, never wait for a queue to idle (or worse, the device). Go do something else while the GPU processes the various operations and come back once it's finished (verify this by testing the fence you used when you submitted the commands). Don't wait on the fence unless you have no other operations to do.

Upvotes: 2

Related Questions