Reputation: 792
Situation: In a Metal kernel function, every thread in a threadgroup reads the exact same value at a time. The kernel pseudocode:
kernel void foo(device int2* ranges,
constant float3& readonlyBuffer,
device float* results,
uint lno [[ threadgroup_position_in_grid ]])
{
float acc = 0.0;
for(int i=ranges[lno].x; i<ranges[lno].y; i++) {
// each thread in threadgroup processes the same value from the buffer
acc += process( readonlyBuffer[i] );
}
results[...] = acc;
}
The problem: in the pursuit of optimizing buffer reads, I changed readonlyBuffer
's address space qualifier from device
to constant
. This had zero impact on the kernel performance although the Apple documentation says something different:
The constant address space is optimized for multiple instances executing a graphics or kernel function accessing the same location in the buffer.
Questions:
Upvotes: 2
Views: 529
Reputation: 6410
In your example code, indexing into readonlyBuffer
would generate a compiler error.
Assuming readonlyBuffer
is declared as a pointer, then the compiler doesn't statically know the size, and can't move the data to the constant memory space.
If readonlyBuffer
is small (you only have 4KB of constant memory to work with), put it into a struct as follows:
struct ReadonlyBuffer {
float3 values[MAX_BUFFER_SIZE];
};
Then do:
kernel void foo(device int2* ranges,
constant ReadonlyBuffer& readonlyBuffer,
device float* results,
uint lno [[ threadgroup_position_in_grid ]])
Finally, run a GPU trace ("Capture GPU Frame") and ensure you don't get the following error:
The Compiler was not able to Preload your Buffer. Kernel Function, Buffer Index: 1.
For more info about Buffer Preloading, see: https://developer.apple.com/videos/play/wwdc2016/606/?time=408
Upvotes: 3