Reputation: 1750
I have a working metal application that is extremely slow, and needs to run faster. I believe the problem is I am creating too many MTLCommandBuffer objects.
The reason I am creating so many MTLCommandBuffer objects is I need to send different uniform values to the pixel shader. I've pasted a snippit of code to illustrate the problem below.
for (int obj_i = 0 ; obj_i < n ; ++obj_i)
{
// I create one render command buffer per object I draw so I can use different uniforms
id <MTLCommandBuffer> mtlCommandBuffer = [metal_info.g_commandQueue commandBuffer];
id <MTLRenderCommandEncoder> renderCommand = [mtlCommandBuffer renderCommandEncoderWithDescriptor:<#(MTLRenderPassDescriptor *)#>]
// glossing over details, but this call has per object specific data
memcpy([global_uniform_buffer contents], per_object_data, sizeof(per_data_object));
[renderCommand setVertexBuffer:object_vertices offset:0 atIndex:0];
// I am reusing a single buffer for all shader calls
// this is killing performance
[renderCommand setVertexBuffer:global_uniform_buffer offset:0 atIndex:1];
[renderCommand drawIndexedPrimitives:MTLPrimitiveTypeTriangle
indexCount:per_object_index_count
indexType:MTLIndexTypeUInt32
indexBuffer:indicies
indexBufferOffset:0];
[renderCommand endEncoding];
[mtlCommandBuffer presentDrawable:frameDrawable];
[mtlCommandBuffer commit];
}
The above code draw as expected, but is EXTREMELY slow. I'm guessing because there is a better way to force pixel shader evaluation than creating a MTLCommandBuffer per object.
I've consider simple allocating a buffer much larger than is needed for a single shader pass and simply using offset to queue up several calls in one render command encoder then execute them. This method seems pretty unorthodox, and I want to make sure I'm solving the issue of needed to send custom data per object in a Metal friendly way.
What is the fastest way to render using multiple passes of the same pixel/vertex shader with per call custom uniform data?
Upvotes: 3
Views: 2317
Reputation: 1256
Don't reuse the same uniform buffer for every object. Doing that destroys all parallelism between the CPU and GPU and causes regular sync points.
Instead, make a separate uniform buffer for each object you are going to render in the frame. In fact you should really create 2 per object and alternate between them each frame so that the GPU can be rendering the last frame whilst you are preparing the next frame on the CPU.
After you do that, you simply refactor your loop so the command buffer and render command work are done once per frame. Your loop should only consist of copying the uniform data, setting the vertex buffer and calling draw primitive.
Upvotes: 7