How do you create and use an indirectCommandBuffer in Swift for Metal GPU computations?

Question

I am currently working on a project that uses the GPU to do computations on large datasets. Currently I'm investigating the potential of using indirectCommandBuffers to potentially speed up our code, especially since I'm are having trouble with its speed on the M1 processor (interestingly enough, the speed at which our program runs is super fast on AMD Metal GPUs). Another reason why I want to do this is to avoid having to create the exact same Compute Command Encoders 500+ times.

However, I'm having troubles with coding the indirect Command Buffers, and I can't seem to find much documentation for it online, especially in Swift. When I first attempted to do this in the project I'm working on, I found that on the M1 it would just crash when I tried setting the MTLComputePipelineState of the MTLIndirectComputeCommand using .setComputePipelineState(), whereas on AMD chips it would hang when trying to commit and execute the commands in the indirectCommandBuffer, and if it got through everything, it would just return memory points of 0 data.

I've created what hopefully is a minimal reproducible example to try and show what issue I'm having; it just adds two numpy arrays received from C 1000 times. Be aware this is just an example to illustrate the issue. Our goal is improve some finite difference code with Metal.

I'm currently running MacOS 12.4.

Below is the Swift function:

import Metal
import MetalPerformanceShaders
import Accelerate
import Foundation

@_cdecl("metalswift_add")
public func addition(array1: UnsafeMutablePointer,array2: UnsafeMutablePointer, length: Int) -> UnsafeMutablePointer {

  var bFound = false
  var device : MTLDevice!
  device = MTLCreateSystemDefaultDevice()!
  let defaultLibrary = try! device.makeLibrary(filepath: "metal.metallib")
  let metalswift_addfunction = defaultLibrary.makeFunction(name: "metalswift_add")!
  let descriptor = MTLComputePipelineDescriptor()
  descriptor.computeFunction = metalswift_addfunction
  descriptor.supportIndirectCommandBuffers = true
  let computePipelineState = try! device.makeComputePipelineState(descriptor: descriptor, options: .init(), reflection: nil)
  
  var Ref1 : UnsafeMutablePointer = UnsafeMutablePointer(array1)
  var Ref2 : UnsafeMutablePointer = UnsafeMutablePointer(array2)
  var size = length
  let SizeBuffer : UnsafeMutableRawPointer = UnsafeMutableRawPointer(&size)

  let ll = MemoryLayout.stride * length

  var Buffer1:MTLBuffer! = device.makeBuffer(bytes:Ref1, length: ll, options:[])
  var Buffer2:MTLBuffer! = device.makeBuffer(bytes:Ref2, length: ll, options:[])
  var MetalBuffer:MTLBuffer! = device.makeBuffer(length: ll, options:[])
  let Size:MTLBuffer! = device.makeBuffer(bytes: SizeBuffer, length: MemoryLayout.size, options: [])

  var icbDescriptor:MTLIndirectCommandBufferDescriptor = MTLIndirectCommandBufferDescriptor()
  icbDescriptor.commandTypes.insert(MTLIndirectCommandType.concurrentDispatchThreads)
  icbDescriptor.inheritBuffers = false
  icbDescriptor.inheritPipelineState = false
  icbDescriptor.maxKernelBufferBindCount = 4
  var indirectCommandBuffer = device.makeIndirectCommandBuffer(descriptor: icbDescriptor, maxCommandCount: 1)!

  let icbCommand = indirectCommandBuffer.indirectComputeCommandAt(0)
  icbCommand.setComputePipelineState(computePipelineState)
  icbCommand.setKernelBuffer(Buffer1, offset: 0, at: 0)
  icbCommand.setKernelBuffer(Buffer2, offset: 0, at: 1)
  icbCommand.setKernelBuffer(MetalBuffer, offset: 0, at: 2)
  icbCommand.setKernelBuffer(Size, offset: 0, at: 3)
  icbCommand.concurrentDispatchThreads(MTLSize(width:computePipelineState.threadExecutionWidth, height: 1, depth: 1), threadsPerThreadgroup:MTLSize(width:computePipelineState.maxTotalThreadsPerThreadgroup, height: 1, depth: 1))
  icbCommand.setBarrier()
  
  for i in 0..<1000{
  print(i)
  let commandQueue = device.makeCommandQueue()!
  let commandBuffer = commandQueue.makeCommandBuffer()!
  let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()!
  computeCommandEncoder.executeCommandsInBuffer(indirectCommandBuffer, range:0..<1)
  computeCommandEncoder.endEncoding()
  commandBuffer.commit()
  commandBuffer.waitUntilCompleted()
  }
  return(MetalBuffer!.contents().assumingMemoryBound(to: Float.self))
}

This is the Metal Function:

#include 
#include 
using namespace metal;

#define size (*size_pr)
kernel void metalswift_add(const device float *Buffer1 [[ buffer(0) ]],
const device float *Buffer2[[ buffer(1) ]],
device float *MetalBuffer[[ buffer(2) ]],
const device int *size_pr[[ buffer(3) ]]) {
    for (int i=0; i


I had it working without the indirectCommandEncoders, so I believe that it's probably an issue with how I coded the indirectCommandEncoders rather than the Metal function.
If any other information is needed, let me know! Sorry if this is of low quality, this is my first question on stack.
Update: I've updated the code above with some changes that stops the code from crashing at runtime. However, I'm still running into the hanging issue on AMD Metal GPUs, and on the M1 it seems like it only goes through the Metal function once.

How do you create and use an indirectCommandBuffer in Swift for Metal GPU computations?

Answers (1)

Related Questions