Vulkan compute shaders: Most efficient way to tranfer buffer to/from GPU? Retrieving the buffer seems to be slow

Question

I am trying to use Vulkan now for some GPGPU work and am struggling to get decent performance on my GTX 1070.

Let's say I want to run a compute shader against a block of data and write the result into the same block.

To transfer the data to the GPU, I first create a host-visible staging buffer and a device local buffer, then I map the staging buffer's memory to the host's address space, copy the data to it, unmap the staging buffer, and copy the staging buffer to the device buffer.

  VkBuffer stagingBuffer;
  VkDeviceMemory stagingBufferMemory;
  createBuffer(size, VK_BUFFER_USAGE_TRANSFER_SRC_BIT,
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT,
    stagingBuffer, stagingBufferMemory);

  void* bufferData = nullptr;
  vkMapMemory(m_device, stagingBufferMemory, 0, size, 0, &bufferData);
  memcpy(bufferData, data, size);
  vkUnmapMemory(m_device, stagingBufferMemory);

  VkBufferUsageFlags usage = VK_BUFFER_USAGE_TRANSFER_DST_BIT
                           | VK_BUFFER_USAGE_TRANSFER_SRC_BIT
                           | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
  createBuffer(size, usage, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, m_buffer, m_bufferMemory);

  copyBuffer(stagingBuffer, m_buffer, size);

  vkDestroyBuffer(m_device, stagingBuffer, nullptr);
  vkFreeMemory(m_device, stagingBufferMemory, nullptr);

I then run the shader. To retrieve the result, I do as before but in reverse: Create a host-visible staging buffer, copy the device-local buffer to it, map the staging buffer memory into the host's address space, and memcpy it to where I need it.

  VkBuffer stagingBuffer;
  VkDeviceMemory stagingBufferMemory;
  createBuffer(m_bufferSize, VK_BUFFER_USAGE_TRANSFER_DST_BIT,
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT,
    stagingBuffer, stagingBufferMemory);

  copyBuffer(m_buffer, stagingBuffer, m_bufferSize);

  void* bufferData = nullptr;
  vkMapMemory(m_device, stagingBufferMemory, 0, m_bufferSize, 0, &bufferData);
  memcpy(data, bufferData, m_bufferSize);
  vkUnmapMemory(m_device, stagingBufferMemory);

  vkDestroyBuffer(m_device, stagingBuffer, nullptr);
  vkFreeMemory(m_device, stagingBufferMemory, nullptr);

Is there a faster way of doing this? Currently, the running time (as measured on the host side with the standard library's high resolution clock) is dominated by pulling the buffer back down to the host - so much so that it's considerably slower than if I'd done everything on the CPU.

Here are some example figures. This was running against a large buffer, which makes the difference quite extreme:

Running CPU benchmark...
Running time: 20.875 milliseconds

Running GPU benchmark...
Submit time = 41.817
Execution time = 1.066
Retrieval time = 223.479
Running time: 266.391 milliseconds

So to summarize the overall procedure at the moment:

Create a host-visible staging buffer and a device-local buffer
Map the staging buffer to the host's address space
Copy data into the staging buffer
Unmap the staging buffer from the host's address space
Copy the staging buffer to the device-local buffer with vkCmdCopyBuffer
Delete the staging buffer and its VkDeviceMemory
Executing the shader by recording and submitting a command buffer that binds the relevant pipeline and descriptor sets.
Create a staging buffer again (I've tried reusing the old one instead of making a new one, and it make only a small difference)
Copy device-local buffer to staging buffer
Map staging buffer to host's address space
Copy data out of staging buffer
Unmap the staging buffer
Delete the staging buffer and its VkDeviceMemory

What can be done to speed this up?

Vulkan compute shaders: Most efficient way to tranfer buffer to/from GPU? Retrieving the buffer seems to be slow

Answers (1)

Related Questions