Reputation: 3587
This question has two parts, but they are closely related:
Does Metal provide a way to make use of shared threadgroup memory?
For example, in CUDA you can explicitly load data from device memory into shared memory like this:
__shared__ float example1
Does Metal provide such functionality? It appears that all buffer accesses load from global memory, unless there's some hidden magic going on behind the scenes.
This might not be unique to Metal, so any GPU guru could probably help. Apple provides a matrix multiplication example here - I'll paste the kernel below for reference:
typedef struct
{
ushort m, k, n, pbytes, qbytes;
} MetalMatrixDim;
kernel void MatrixMultiply(const device float* A [[ buffer(0) ]],
const device float* B [[ buffer(1) ]],
device float* C [[ buffer(2) ]],
constant MetalMatrixDim& dims [[ buffer(3) ]],
ushort2 gid [[ thread_position_in_grid ]])
{
ushort m = dims.m;
ushort k = dims.k;
ushort n = dims.n;
ushort pbytes = dims.pbytes;
ushort qbytes = dims.qbytes;
ushort2 gidIn = ushort2(gid.x << 3, gid.y << 3);
if (gidIn.x >= m || gidIn.y >= k) return;
const device float4* a = (const device float4*)(A + gidIn.x);
const device float4* b = (const device float4*)(B + gidIn.y);
C = (device float*)((device char*)C + gidIn.x*qbytes);
device float4* c = (device float4*)(C + gidIn.y);
const device float4* Bend = (const device float4*)((const device char*)B + qbytes*n);
float4 s0 = 0.0f, s1 = 0.0f, s2 = 0.0f, s3 = 0.0f;
float4 s4 = 0.0f, s5 = 0.0f, s6 = 0.0f, s7 = 0.0f;
float4 s8 = 0.0f, s9 = 0.0f, s10 = 0.0f, s11 = 0.0f;
float4 s12 = 0.0f, s13 = 0.0f, s14 = 0.0f, s15 = 0.0f;
do
{
float4 aCurr0 = a[0];
float4 aCurr1 = a[1];
float4 bCurr0 = b[0];
float4 bCurr1 = b[1];
s0 += (aCurr0.x * bCurr0);
s2 += (aCurr0.y * bCurr0);
s4 += (aCurr0.z * bCurr0);
s6 += (aCurr0.w * bCurr0);
s1 += (aCurr0.x * bCurr1);
s3 += (aCurr0.y * bCurr1);
s5 += (aCurr0.z * bCurr1);
s7 += (aCurr0.w * bCurr1);
s8 += (aCurr1.x * bCurr0);
s10 += (aCurr1.y * bCurr0);
s12 += (aCurr1.z * bCurr0);
s14 += (aCurr1.w * bCurr0);
s9 += (aCurr1.x * bCurr1);
s11 += (aCurr1.y * bCurr1);
s13 += (aCurr1.z * bCurr1);
s15 += (aCurr1.w * bCurr1);
a = (device float4*)((device char*)a + pbytes);
b = (device float4*)((device char*)b + qbytes);
} while(b < Bend);
c[0] = s0; c[1] = s1; c = (device float4*)((device char*)c + qbytes);
c[0] = s2; c[1] = s3; c = (device float4*)((device char*)c + qbytes);
c[0] = s4; c[1] = s5; c = (device float4*)((device char*)c + qbytes);
c[0] = s6; c[1] = s7; c = (device float4*)((device char*)c + qbytes);
c[0] = s8; c[1] = s9; c = (device float4*)((device char*)c + qbytes);
c[0] = s10; c[1] = s11; c = (device float4*)((device char*)c + qbytes);
c[0] = s12; c[1] = s13; c = (device float4*)((device char*)c + qbytes);
c[0] = s14; c[1] = s15;
}
The question: For each thread, this kernel computes an 8 x 8 sector of the output C
. What is the reason for this? Why not allow each thread to compute a single element of C
, which would remove the multiple-of-8 sizing restriction and provide better parallelization for smaller matrices?
I assume that this implementation must be somehow optimized, and I am guessing that it has to do with thread synchronization and memory access - which is why I've bundled it with Question 1. Any ideas?
Upvotes: 0
Views: 2250
Reputation: 19
It worth to add, that where you do know how much array length you need in a kernel, you allocate an array inside a kernel by
threadgroup float example2[50];
And when you don't know it and only host can decide a threadgroup shared memory length, you use the approach with
kernel void my_func(...,
threadgroup float *example2 [[threadgroup(0)]],
...)
setThreadgroupMemoryLength:atIndex:
Upvotes: 0
Reputation: 90671
I don't see a relationship between your two questions. Regarding question 1: yes, Metal provides for shared threadgroup memory in compute functions. Just specify the threadgroup
address space qualifier on the variable declaration. For example:
threadgroup float example1;
You can also specify threadgroup buffers as input arguments to a compute function.
kernel void my_func(...,
threadgroup float *example2 [[threadgroup(0)]],
...)
{
...
}
The buffer is allocated by the device. The size of the buffer is set using the -setThreadgroupMemoryLength:atIndex:
method of the compute command encoder.
Upvotes: 4