Reputation: 397
I'm learning Apple's Metal trying to do some GPU computation.
I checked the matrix multiplication example given by Apple. There's a point I cannot understand.
In the file MetalMatrixMult.h
// Number of rows in matrices A and C.
@property (nonatomic) uint16_t m;
// Number of columns in matrix A; number of rows in matrix B.
@property (nonatomic) uint16_t n;
// Number of columns in matrices B and C.
@property (nonatomic) uint16_t k;
// Output matrix (padded) C row count
@property (nonatomic, readonly) uint16_t M;
// Output matrix (padded) C column count
@property (nonatomic, readonly) uint16_t K;
// Output matrix C = A x B
@property (nonatomic, readonly) float* output;
It says the Matrix C is padded. I'm not clear what pad
means here. Is it some kind of alignment? Cause I know there are types alignment in Metal's shader language specification, but I don't know why we need to pad a buffer herer.
Thanks.
Upvotes: 5
Views: 1285
Reputation: 3577
It has to do with optimizing memory access. Your GPU has a number of threadgroups, each containing a relatively small amount of dedicated memory (a few KB) that can be accessed very quickly. This is separate from your GPU's main memory, which might be a few GBs of comparatively slow memory.
Since it's unlikely that all 3 matrices (A
, B
and C
) can fit into a single threadgroup's memory, and falling back to main memory inside loops would be extremely slow, we divide the computation into "blocks" or sectors. Imagine dividing the result matrix C
into a grid, where each sector is a collection of 8 x 8 elements: we can then instruct Threadgroup 1 to compute the result for the top-left sector while other threadgroups compute the other sectors simulataneously. In this case, Threadgroup 1 only needs the first 8 rows of A
and the first 8 columns of B
to compute its portion of C
. This means we can send a much smaller amount of data to Threadgroup 1, keeping it well within the cache limit.
The reason Metal requires us to pad the matrices is so that it can divide C
into a perfect grid. If your true result matrix is 12 x 18, and the sector size is 8 x 8, that means C
is 1.5 x 2.25 sectors. The GPU can't efficiently operate on partial sectors, so you must pad your matrices with zeros to reach whole numbers - in this case 2 x 3 sectors or 16 x 24 elements. You sacrifice a little bit of storage and a few more clock cycles for highly optimized parallel processing.
Upvotes: 5