Crt Tax
Crt Tax

Reputation: 397

Why Apple's Metal Matrix Multiplication Example needs padding C

I'm learning Apple's Metal trying to do some GPU computation.

I checked the matrix multiplication example given by Apple. There's a point I cannot understand.

In the file MetalMatrixMult.h

// Number of rows in matrices A and C.
@property (nonatomic) uint16_t m;

// Number of columns in matrix A; number of rows in matrix B.
@property (nonatomic) uint16_t n;

// Number of columns in matrices B and C.
@property (nonatomic) uint16_t k;

// Output matrix (padded) C row count
@property (nonatomic, readonly) uint16_t M;

// Output matrix (padded) C column count
@property (nonatomic, readonly) uint16_t K;

// Output matrix C = A x B
@property (nonatomic, readonly) float* output;

It says the Matrix C is padded. I'm not clear what pad means here. Is it some kind of alignment? Cause I know there are types alignment in Metal's shader language specification, but I don't know why we need to pad a buffer herer.

Thanks.

Upvotes: 5

Views: 1285

Answers (1)

Hundley
Hundley

Reputation: 3577

It has to do with optimizing memory access. Your GPU has a number of threadgroups, each containing a relatively small amount of dedicated memory (a few KB) that can be accessed very quickly. This is separate from your GPU's main memory, which might be a few GBs of comparatively slow memory.

Since it's unlikely that all 3 matrices (A, B and C) can fit into a single threadgroup's memory, and falling back to main memory inside loops would be extremely slow, we divide the computation into "blocks" or sectors. Imagine dividing the result matrix C into a grid, where each sector is a collection of 8 x 8 elements: we can then instruct Threadgroup 1 to compute the result for the top-left sector while other threadgroups compute the other sectors simulataneously. In this case, Threadgroup 1 only needs the first 8 rows of A and the first 8 columns of B to compute its portion of C. This means we can send a much smaller amount of data to Threadgroup 1, keeping it well within the cache limit.

The reason Metal requires us to pad the matrices is so that it can divide C into a perfect grid. If your true result matrix is 12 x 18, and the sector size is 8 x 8, that means C is 1.5 x 2.25 sectors. The GPU can't efficiently operate on partial sectors, so you must pad your matrices with zeros to reach whole numbers - in this case 2 x 3 sectors or 16 x 24 elements. You sacrifice a little bit of storage and a few more clock cycles for highly optimized parallel processing.

Upvotes: 5

Related Questions