Hundley
Hundley

Reputation: 3587

Metal - optimizing memory access

This question has two parts, but they are closely related:

Question 1

Does Metal provide a way to make use of shared threadgroup memory?

For example, in CUDA you can explicitly load data from device memory into shared memory like this:

__shared__ float example1

Does Metal provide such functionality? It appears that all buffer accesses load from global memory, unless there's some hidden magic going on behind the scenes.

Question 2

This might not be unique to Metal, so any GPU guru could probably help. Apple provides a matrix multiplication example here - I'll paste the kernel below for reference:

typedef struct
{
    ushort m, k, n, pbytes, qbytes;
} MetalMatrixDim;


kernel void MatrixMultiply(const device float*       A    [[ buffer(0) ]],
                           const device float*       B    [[ buffer(1) ]],
                           device float*             C    [[ buffer(2) ]],
                           constant MetalMatrixDim&  dims [[ buffer(3) ]],
                           ushort2                   gid  [[ thread_position_in_grid ]])
{
    ushort m = dims.m;
    ushort k = dims.k;
    ushort n = dims.n;

    ushort pbytes = dims.pbytes;
    ushort qbytes = dims.qbytes;

    ushort2 gidIn = ushort2(gid.x << 3, gid.y << 3);

    if (gidIn.x >= m || gidIn.y >= k) return;

    const device float4* a = (const device float4*)(A + gidIn.x);
    const device float4* b = (const device float4*)(B + gidIn.y);

    C = (device float*)((device char*)C + gidIn.x*qbytes);

    device float4* c = (device float4*)(C + gidIn.y);

    const device float4* Bend = (const device float4*)((const device char*)B + qbytes*n);

    float4 s0  = 0.0f, s1  = 0.0f, s2  = 0.0f, s3  = 0.0f;
    float4 s4  = 0.0f, s5  = 0.0f, s6  = 0.0f, s7  = 0.0f;
    float4 s8  = 0.0f, s9  = 0.0f, s10 = 0.0f, s11 = 0.0f;
    float4 s12 = 0.0f, s13 = 0.0f, s14 = 0.0f, s15 = 0.0f;

    do
    {
        float4 aCurr0 = a[0];
        float4 aCurr1 = a[1];
        float4 bCurr0 = b[0];
        float4 bCurr1 = b[1];

        s0   += (aCurr0.x * bCurr0);
        s2   += (aCurr0.y * bCurr0);
        s4   += (aCurr0.z * bCurr0);
        s6   += (aCurr0.w * bCurr0);

        s1   += (aCurr0.x * bCurr1);
        s3   += (aCurr0.y * bCurr1);
        s5   += (aCurr0.z * bCurr1);
        s7   += (aCurr0.w * bCurr1);

        s8   += (aCurr1.x * bCurr0);
        s10  += (aCurr1.y * bCurr0);
        s12  += (aCurr1.z * bCurr0);
        s14  += (aCurr1.w * bCurr0);

        s9   += (aCurr1.x * bCurr1);
        s11  += (aCurr1.y * bCurr1);
        s13  += (aCurr1.z * bCurr1);
        s15  += (aCurr1.w * bCurr1);

        a = (device float4*)((device char*)a + pbytes);
        b = (device float4*)((device char*)b + qbytes);

    } while(b < Bend);

    c[0] = s0;  c[1] = s1;  c = (device float4*)((device char*)c + qbytes);
    c[0] = s2;  c[1] = s3;  c = (device float4*)((device char*)c + qbytes);
    c[0] = s4;  c[1] = s5;  c = (device float4*)((device char*)c + qbytes);
    c[0] = s6;  c[1] = s7;  c = (device float4*)((device char*)c + qbytes);
    c[0] = s8;  c[1] = s9;  c = (device float4*)((device char*)c + qbytes);
    c[0] = s10; c[1] = s11; c = (device float4*)((device char*)c + qbytes);
    c[0] = s12; c[1] = s13; c = (device float4*)((device char*)c + qbytes);
    c[0] = s14; c[1] = s15;
}

The question: For each thread, this kernel computes an 8 x 8 sector of the output C. What is the reason for this? Why not allow each thread to compute a single element of C, which would remove the multiple-of-8 sizing restriction and provide better parallelization for smaller matrices?

I assume that this implementation must be somehow optimized, and I am guessing that it has to do with thread synchronization and memory access - which is why I've bundled it with Question 1. Any ideas?

Upvotes: 0

Views: 2250

Answers (2)

scherv
scherv

Reputation: 19

It worth to add, that where you do know how much array length you need in a kernel, you allocate an array inside a kernel by

threadgroup float example2[50];

And when you don't know it and only host can decide a threadgroup shared memory length, you use the approach with

kernel void my_func(...,
                    threadgroup float *example2 [[threadgroup(0)]],
                    ...)
setThreadgroupMemoryLength:atIndex:

Upvotes: 0

Ken Thomases
Ken Thomases

Reputation: 90671

I don't see a relationship between your two questions. Regarding question 1: yes, Metal provides for shared threadgroup memory in compute functions. Just specify the threadgroup address space qualifier on the variable declaration. For example:

threadgroup float example1;

You can also specify threadgroup buffers as input arguments to a compute function.

kernel void my_func(...,
                    threadgroup float *example2 [[threadgroup(0)]],
                    ...)
{
    ...
}

The buffer is allocated by the device. The size of the buffer is set using the -setThreadgroupMemoryLength:atIndex: method of the compute command encoder.

Upvotes: 4

Related Questions