Zoltán
Zoltán

Reputation: 708

Multi threaded rendering (command buffer generation) in Vulkan is slower than single threaded

I am trying to implement multi threaded command buffer generation (using per-thread command pool and secondary command buffers), but there are little performance gain of using multiple threads.

First, I thought that my thread pool code was incorrectly written, but I tried Sascha Willems's thread pool implementation, and nothing changed (so I don't think that's an issue)

Second, I searched for multi threading performance issues and I found that accessing same variables/resources from different thread causes performance drop, but still i can't figure out the problem in my case.

I also downloaded Sascha Willems's multi threading code, run it, and it worked just fine. I modified the number of working threads and the performance gain using multiple threads is clearly visible.

Here are some FPS results for rendering 600 objects (same model). You can see what my problem is:

core count      Sascha Willems's        my result           my result (avg. FPS)
              result ( avg. FPS)       (avg. FPS)        validation layer disabled

    1               45                      30                      55
    2               83                      33                      72
    4               110                     40                      84
    6               155                     42                      103
    8               162                     42                      104
    10              173                     40                      111
    12              175                     40                      119

This is where i prepare the thread data

void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
    vk::CommandPoolCreateInfo (
        vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
        graphicsQueueIdx
    )
);

primaryCommandBuffer = m_device.allocateCommandBuffers (
    vk::CommandBufferAllocateInfo (
        primaryCommandPool,
        vk::CommandBufferLevel::ePrimary,
        1
    )
)[0];

threadData.resize(numberOfThreads);

for (int i = 0; i < numberOfThreads; ++i)
{
    threadData[i].commandPool = m_device.createCommandPool (
        vk::CommandPoolCreateInfo (
            vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
            graphicsQueueIdx
        )
    );

    threadData[i].commandBuffer = m_device.allocateCommandBuffers (
        vk::CommandBufferAllocateInfo (
            threadData[i].commandPool,
            vk::CommandBufferLevel::eSecondary,
            numberOfObjectsPerThread
        )
    );

    for (int j = 0; j < numberOfObjectsPerThread; ++j)
    {
        VertexPushConstant pushConstant = { someRandomPosition()};
        threadData[i].pushConstBlock.push_back(pushConstant);
    }
}
}

Here is my render loop code where i give job for each thread:

while (!display.IsWindowClosed())
{
display.PollEvents();

m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, &currentBuffer);

primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
    vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
    vk::SubpassContents::eSecondaryCommandBuffers);

vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;

for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        threadPool.threads[t]->addJob([=]
        {
            std::array<vk::DeviceSize, 1> offsets = { 0 };
            vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
            vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));

            threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
            threadData[t].commandBuffer[i].setViewport(0, viewport);
            threadData[t].commandBuffer[i].setScissor(0, renderArea);
            threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
            threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
            threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
            threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
            threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
            threadData[t].commandBuffer[i].end();
        });
    }
}

threadPool.wait();

std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        commandBuffers.push_back(threadData[t].commandBuffer[i]);
    }
}

primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();

submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}

If you have any idea on what am I missing / what i'm doing wrong, please let me know.

Here is the full VS 2017 project if anyone wants to play with it :D

I know it's a MESS, but I'm just learning Vulkan.

Upvotes: 1

Views: 1425

Answers (1)

Zolt&#225;n
Zolt&#225;n

Reputation: 708

It seems that I've found the problem: I have left the validation layers enabled. I disabled it, and the performance increased a lot, I've updated the table in the question with a 4th row for comparison. Who knew that validation layers eat up so much run time. If anyone wants to measure Vulkan's performance, don't forget to disable it!

Upvotes: 5

Related Questions