Reputation: 163
I saw an opportunity to improve my app performance by using a Metal compute pipeline. However, my initial testing revealed the the compute pipeline was absurdly slow (at least on older device).
So I did a sample project to compare the compute and render pipelines performance. The program takes a 2048 x 2048 source texture and convert it to grayscale in a destination texture.
On an iPhone 5S, it took 3 ms for the fragment shader to do the convertion. However, it took 177 ms for the compute kernel to do the same thing. That is 59 times longer!!!
What is your exeperience with the compute pipeline on older device? It isn't absurdly slow?
Here's are my fragment and compute functions:
// Grayscale Fragment Function
fragment half4 grayscaleFragment(RasterizerData in [[stage_in]],
texture2d<half> inTexture [[texture(0)]])
{
constexpr sampler textureSampler;
half4 inColor = inTexture.sample(textureSampler, in.textureCoordinate);
half gray = dot(inColor.rgb, kRec709Luma);
return half4(gray, gray, gray, 1.0);
}
// Grayscale Kernel Function
kernel void grayscaleKernel(uint2 gid [[thread_position_in_grid]],
texture2d<half, access::read> inTexture [[texture(0)]],
texture2d<half, access::write> outTexture [[texture(1)]])
{
half4 inColor = inTexture.read(gid);
half gray = dot(inColor.rgb, kRec709Luma);
outTexture.write(half4(gray, gray, gray, 1.0), gid);
}
Compute and render methods
- (void)compute {
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
// Compute encoder
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_computePipelineState];
[computeEncoder setTexture:_srcTexture atIndex:0];
[computeEncoder setTexture:_dstTexture atIndex:1];
[computeEncoder dispatchThreadgroups:_threadgroupCount threadsPerThreadgroup:_threadgroupSize];
[computeEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
- (void)render {
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
// Render pass descriptor
MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionDontCare;
renderPassDescriptor.colorAttachments[0].texture = _dstTexture;
renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;
// Render encoder
id<MTLRenderCommandEncoder> renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
[renderEncoder setRenderPipelineState:_renderPipelineState];
[renderEncoder setFragmentTexture:_srcTexture atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[renderEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
And Metal setup:
- (void)setupMetal
{
// Get metal device
_device = MTLCreateSystemDefaultDevice();
// Create the command queue
_commandQueue = [_device newCommandQueue];
id<MTLLibrary> defaultLibrary = [_device newDefaultLibrary];
// Create compute pipeline state
_computePipelineState = [_device newComputePipelineStateWithFunction:[defaultLibrary newFunctionWithName:@"grayscaleKernel"] error:nil];
// Create render pipeline state
MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
pipelineStateDescriptor.vertexFunction = [defaultLibrary newFunctionWithName:@"vertexShader"];
pipelineStateDescriptor.fragmentFunction = [defaultLibrary newFunctionWithName:@"grayscaleFragment"];
pipelineStateDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm;
_renderPipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor error:nil];
// Create source and destination texture descriptor
// Since the compute kernel function doesn't check if pixels are within the bounds of the destination texture, make sure texture width
// and height are multiples of the pipeline threadExecutionWidth and (threadExecutionWidth / maxTotalThreadsPerThreadgroup) respectivly.
MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm
width:2048
height:2048
mipmapped:NO];
// Create source texture
textureDescriptor.usage = MTLTextureUsageShaderRead;
_srcTexture = [_device newTextureWithDescriptor:textureDescriptor];
// Create description texture
textureDescriptor.usage = MTLTextureUsageShaderWrite | MTLTextureUsageRenderTarget;
_dstTexture = [_device newTextureWithDescriptor:textureDescriptor];
// Set the compute kernel's threadgroup size
NSUInteger threadWidth = _computePipelineState.threadExecutionWidth;
NSUInteger threadMax = _computePipelineState.maxTotalThreadsPerThreadgroup;
_threadgroupSize = MTLSizeMake(threadWidth, threadMax / threadWidth, 1);
// Set the compute kernel's threadgroup count
_threadgroupCount.width = (_srcTexture.width + _threadgroupSize.width - 1) / _threadgroupSize.width;
_threadgroupCount.height = (_srcTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
_threadgroupCount.depth = 1;
}
Upvotes: 4
Views: 1372
Reputation: 4425
The Metal compute pipeline is unusable on A7 class CPU/GPU devices. The same compute pipeline has great performance on A8 and newer devices. Your options for dealing with this are to create fragment shader impls for A7 devices and use compute logic for all newer devices, or you can export computation to the CPUs on A7 (there are at least 2 CPUs with this device class). You could also just use all fragment shaders for all devices, but much better performance on complex code is possible with compute kernels, so it is something to think about.
Upvotes: 4