Reputation: 11
This is a basic question asking to calculate the "peak achievable performance" (in MFLOPS) of vector-vector, vector-matrix, and matrix-matrix multiplication, given certain information about the memory system. The following is the set-up of the question:
Part (a): For arbitrary vectors a and b, what the is the peak performance, given the following code:
1 /* dot product loop */
2 for (i = 0; i < dim; i++)
3 dot_prod += a[i] * b[i];
Part (b): What is the peak performance of the matrix-vector multiplication given by the following code:
1 /* matrix-vector product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 c[i] += a[i][j] * b[j];
Part (c): What is the performance of the matrix-matrix multiplication given by the following code:
1 /* matrix-matrix product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 for (k = 0; k < dim; k++)
5 c[i][j] += a[i][k] * b[k][j];
I actually know the correct solutions to each part (40, 80, and 16 MFLOPS respectively), but I cannot fully understand the reasoning behind said solutions. For some context, I don't have a CS/EE background so my understanding of cache and memory systems is very limited. This is my first time trying to understand some of these concepts in a more academic manner.
For part (a), I understand that you can fetch 8 words with 2 cache line fetches. In other words, 8 floating point operations will be performed for every 2 cache line fetches. This is where the 40 MFLOPS must be coming from since 4 operations every cycle is equivalent to 40 MFLOPS.
Part (b) is where I start getting confused. According to the solution, apparently if the vector b is cached, 8 operations can be performed on one cache line. This would result to 8 operations per cycle i.e. 80 MFLOPS. At this point, I'm confused as to why the fetching of vector b into the cache does not seem to be taken into account. The problem statement never describes vector b as existing in the cache already. Doesn't that mean there will be a latency associated with retrieving that vector and initially placing it into the cache?
Part (c) actually makes more sense to me since 8 floating-point operations are carried out over 5 cache lines (1 for matrix A and 4 for matrix B since you have to access elements column-wise).
The main confusion I'm having is with part (b). I'm also having a hard time finding similar examples. Everything I've been finding has either been way over my head, or not detailed enough. If anyone can give a simple and straightforward explanation (especially for part (b)), that would really help with my understanding of these fundamental concepts.
Many thanks in advance! Just trying to learn here!
Upvotes: 1
Views: 2607
Reputation: 11547
We can consider that this problem is memory bound and consider only cache and memory access to solve this problem.
a/ The first iteration creates a cache miss. The next 3 iterations access to a[i] and b[i] only refers to in cache data. Hence, every 4 iterations, there 2 cache misses for a[i] and b[i]. So 4 iterations last 2*100 cycles (memory latency). In 4 iterations, we perform 4*2 ops (* and +=).
So 8ops every 200ns is 8/2*10⁻6 ops/s=40MFlops
b/ After the warm up phase, vector b remains in the cache and will no longer create cache misses. Similarly, misses for c are once every 4N operations and can be neglected. So Every 4 iterations, there is 1 cache miss. 4 iterations last 100 cycles and perform 4*2 ops.
8 ops every 10Ons is 80Glops
c/ The matrix multiplication is performed the wrong way. Every access to b[k][j] refers to a different cache line. Indeed, what you say in the problem description is wrong. If matrices are 4k*4k, they hold 16k elements, and require at least 4*16kB=64kB (if floats,128kB if doubles) (and not 16kB). Accordingly matrices do NOT fit in the 32kB cache and every access to b[k][j] creates a cache miss. Every 4 iterations, a[i][k] is a cache miss and misses for c[i][j] can be neglected. So 4 iterations requires 5 mem access and last 5*100 cycles. In 4 iterations there are 4*2 ops (* and +=).
So 8 ops in 500ns is 16GFlops.
Upvotes: 2