Metal compute pipeline absurdly slow

Question

I saw an opportunity to improve my app performance by using a Metal compute pipeline. However, my initial testing revealed the the compute pipeline was absurdly slow (at least on older device).

So I did a sample project to compare the compute and render pipelines performance. The program takes a 2048 x 2048 source texture and convert it to grayscale in a destination texture.

On an iPhone 5S, it took 3 ms for the fragment shader to do the convertion. However, it took 177 ms for the compute kernel to do the same thing. That is 59 times longer!!!

What is your exeperience with the compute pipeline on older device? It isn't absurdly slow?

Here's are my fragment and compute functions:

// Grayscale Fragment Function
fragment half4 grayscaleFragment(RasterizerData in [[stage_in]],
                                 texture2d inTexture [[texture(0)]])
{
    constexpr sampler textureSampler;

    half4 inColor  = inTexture.sample(textureSampler, in.textureCoordinate);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    return half4(gray, gray, gray, 1.0);
}


// Grayscale Kernel Function
kernel void grayscaleKernel(uint2 gid [[thread_position_in_grid]],
                            texture2d inTexture [[texture(0)]],
                            texture2d outTexture [[texture(1)]])
{
    half4 inColor  = inTexture.read(gid);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    outTexture.write(half4(gray, gray, gray, 1.0), gid);
}

Compute and render methods

- (void)compute {

    id commandBuffer = [_commandQueue commandBuffer];

    // Compute encoder
    id computeEncoder = [commandBuffer computeCommandEncoder];
    [computeEncoder setComputePipelineState:_computePipelineState];
    [computeEncoder setTexture:_srcTexture atIndex:0];
    [computeEncoder setTexture:_dstTexture atIndex:1];
    [computeEncoder dispatchThreadgroups:_threadgroupCount threadsPerThreadgroup:_threadgroupSize];
    [computeEncoder endEncoding];

    [commandBuffer commit];

    [commandBuffer waitUntilCompleted];
}


- (void)render {

    id commandBuffer = [_commandQueue commandBuffer];

    // Render pass descriptor
    MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
    renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionDontCare;
    renderPassDescriptor.colorAttachments[0].texture = _dstTexture;
    renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;

    // Render encoder
    id renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
    [renderEncoder setRenderPipelineState:_renderPipelineState];
    [renderEncoder setFragmentTexture:_srcTexture atIndex:0];
    [renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
    [renderEncoder endEncoding];

    [commandBuffer commit];

    [commandBuffer waitUntilCompleted];
}

And Metal setup:

- (void)setupMetal
{
    // Get metal device
    _device = MTLCreateSystemDefaultDevice();

    // Create the command queue
    _commandQueue = [_device newCommandQueue];

    id defaultLibrary = [_device newDefaultLibrary];

    // Create compute pipeline state
    _computePipelineState = [_device newComputePipelineStateWithFunction:[defaultLibrary newFunctionWithName:@"grayscaleKernel"] error:nil];

    // Create render pipeline state
    MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
    pipelineStateDescriptor.vertexFunction = [defaultLibrary newFunctionWithName:@"vertexShader"];
    pipelineStateDescriptor.fragmentFunction = [defaultLibrary newFunctionWithName:@"grayscaleFragment"];
    pipelineStateDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm;
    _renderPipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor error:nil];

    // Create source and destination texture descriptor
    // Since the compute kernel function doesn't check if pixels are within the bounds of the destination texture, make sure texture width
    // and height are multiples of the pipeline threadExecutionWidth and (threadExecutionWidth / maxTotalThreadsPerThreadgroup) respectivly.
    MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm
                                                                                                 width:2048
                                                                                                height:2048
                                                                                             mipmapped:NO];
    // Create source texture
    textureDescriptor.usage = MTLTextureUsageShaderRead;
    _srcTexture = [_device newTextureWithDescriptor:textureDescriptor];

    // Create description texture
    textureDescriptor.usage = MTLTextureUsageShaderWrite | MTLTextureUsageRenderTarget;
    _dstTexture = [_device newTextureWithDescriptor:textureDescriptor];

    // Set the compute kernel's threadgroup size
    NSUInteger threadWidth = _computePipelineState.threadExecutionWidth;
    NSUInteger threadMax = _computePipelineState.maxTotalThreadsPerThreadgroup;
    _threadgroupSize = MTLSizeMake(threadWidth, threadMax / threadWidth, 1);

     // Set the compute kernel's threadgroup count
    _threadgroupCount.width  = (_srcTexture.width  + _threadgroupSize.width -  1) / _threadgroupSize.width;
    _threadgroupCount.height = (_srcTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
    _threadgroupCount.depth = 1;
}

Metal compute pipeline absurdly slow

Answers (1)

Related Questions