How to find the peak achievable performance of basic vector/matrix operations with cache memory system?

This is a basic question asking to calculate the "peak achievable performance" (in MFLOPS) of vector-vector, vector-matrix, and matrix-matrix multiplication, given certain information about the memory system. The following is the set-up of the question:

memory system with a level 1 cache of 32 KB and DRAM of 512 MB with the processor operating at 1 GHz
The latency to L1 cache is one cycle and the latency to DRAM is 100 cycles
In each memory cycle, the processor fetches four words (i.e. cache line size is four words)
The matrices are of dimension 4K x 4K. (Each row of the matrix takes 16 KB of storage) and laid out in row-order fashion

Part (a): For arbitrary vectors a and b, what the is the peak performance, given the following code:

1 /* dot product loop */
2 for (i = 0; i < dim; i++)
3 dot_prod += a[i] * b[i];

Part (b): What is the peak performance of the matrix-vector multiplication given by the following code:

1 /* matrix-vector product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 c[i] += a[i][j] * b[j];

Part (c): What is the performance of the matrix-matrix multiplication given by the following code:

1 /* matrix-matrix product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 for (k = 0; k < dim; k++)
5 c[i][j] += a[i][k] * b[k][j];

I actually know the correct solutions to each part (40, 80, and 16 MFLOPS respectively), but I cannot fully understand the reasoning behind said solutions. For some context, I don't have a CS/EE background so my understanding of cache and memory systems is very limited. This is my first time trying to understand some of these concepts in a more academic manner.

For part (a), I understand that you can fetch 8 words with 2 cache line fetches. In other words, 8 floating point operations will be performed for every 2 cache line fetches. This is where the 40 MFLOPS must be coming from since 4 operations every cycle is equivalent to 40 MFLOPS.

Part (b) is where I start getting confused. According to the solution, apparently if the vector b is cached, 8 operations can be performed on one cache line. This would result to 8 operations per cycle i.e. 80 MFLOPS. At this point, I'm confused as to why the fetching of vector b into the cache does not seem to be taken into account. The problem statement never describes vector b as existing in the cache already. Doesn't that mean there will be a latency associated with retrieving that vector and initially placing it into the cache?

Part (c) actually makes more sense to me since 8 floating-point operations are carried out over 5 cache lines (1 for matrix A and 4 for matrix B since you have to access elements column-wise).

The main confusion I'm having is with part (b). I'm also having a hard time finding similar examples. Everything I've been finding has either been way over my head, or not detailed enough. If anyone can give a simple and straightforward explanation (especially for part (b)), that would really help with my understanding of these fundamental concepts.

Many thanks in advance! Just trying to learn here!

Upvotes: 1

Answers (2)

Wang_tl

Reputation: 1

b/ After the first iteration of the outer loop, vector b is cached.

Upvotes: 0

Alain Merigot

Reputation: 11547

We can consider that this problem is memory bound and consider only cache and memory access to solve this problem.

a/ The first iteration creates a cache miss. The next 3 iterations access to a[i] and b[i] only refers to in cache data. Hence, every 4 iterations, there 2 cache misses for a[i] and b[i]. So 4 iterations last 2*100 cycles (memory latency). In 4 iterations, we perform 4*2 ops (* and +=).

So 8ops every 200ns is 8/2*10⁻6 ops/s=40MFlops

b/ After the warm up phase, vector b remains in the cache and will no longer create cache misses. Similarly, misses for c are once every 4N operations and can be neglected. So Every 4 iterations, there is 1 cache miss. 4 iterations last 100 cycles and perform 4*2 ops.

8 ops every 10Ons is 80Glops

c/ The matrix multiplication is performed the wrong way. Every access to b[k][j] refers to a different cache line. Indeed, what you say in the problem description is wrong. If matrices are 4k*4k, they hold 16k elements, and require at least 4*16kB=64kB (if floats,128kB if doubles) (and not 16kB). Accordingly matrices do NOT fit in the 32kB cache and every access to b[k][j] creates a cache miss. Every 4 iterations, a[i][k] is a cache miss and misses for c[i][j] can be neglected. So 4 iterations requires 5 mem access and last 5*100 cycles. In 4 iterations there are 4*2 ops (* and +=).

So 8 ops in 500ns is 16GFlops.

Upvotes: 2

How to find the peak achievable performance of basic vector/matrix operations with cache memory system?

Answers (2)

Related Questions