Reputation: 24815
I have wrote a basic three loop matrix multiplication using a serial and OpenMP implementations. For the same size (3200x3200), the perf stat -a -e instructions,cycles
shows:
Serial
265,755,992,060 instructions # 0.71 insn per cycle
375,319,584,656 cycles
85.923380841 seconds time elapsed
Parallel (16 threads)
264,524,937,733 instructions # 0.30 insn per cycle
883,342,095,910 cycles
13.381343295 seconds time elapsed
In the parallel run, I expect that the number of cycles roughly be the same as serial run. But it isn't.
Any thoughts for the differences?
UPDATE:
I rerun the experiments with 8 and 16 threads since the processor has maximum 16 threads.
Using 8 threads
Max nthread = 16
Total execution Time in seconds: 13.4407235400
MM execution Time in seconds: 13.3349801241
Performance counter stats for 'system wide':
906.51 Joules power/energy-pkg/
264,995,810,457 instructions # 0.59 insn per cycle
449,772,039,792 cycles
13.469242993 seconds time elapsed
and
Using 16 threads
Max nthread = 16
Total execution Time in seconds: 13.2618084711
MM execution Time in seconds: 13.1565077840
Performance counter stats for 'system wide':
1,000.39 Joules power/energy-pkg/
264,309,881,365 instructions # 0.30 insn per cycle
882,881,366,456 cycles
13.289234564 seconds time elapsed
As you see, the wall clocks are the same, roughly but the cycles for 16 threads are 2 times of 8 threads. That means with higher cycle count and lower IPC, it is possible to keep the wall clock as before with more threads. According to perf list
, the event is cpu-cycles OR cycles [Hardware event]
. I would like to know is that the average cycles for one core or aggregate N cores? Any comment on that?
Upvotes: 2
Views: 2376
Reputation: 5852
Matrix-matrix multiplication is the typical example of an operation that theoretically has lots of cache reuse, and so can run close to peak speed. except that the standard three-loop implementation doesn't. Optimized implementations that use the cache efficiently are actually six levels deep: each of your three loops gets tiled, and you need to interchange loops and set the tilings just right. This is not trivial.
So your implementation basically uses no cache. Maybe you can have some effects from the L3 cache, but, given a sufficient problem size, certainly not L1, and likely not L2. In other words: you are bandwidth-bound. And then the problem is that you may have 16 cores, but you don't have enough bandwidth to feed all of those.
I must say that your factor of 6--7 is a little disappointing, but maybe your architecture just doesn't have enough bandwidth. I know that for top-of-the-line nodes I would expect something like 12. But why don't you test it? Write a benchmark that does nothing but stream data to and from memory. Vector-vector addition. And then see how many cores you can get linear speed up.
To address some points you raise in your reply:
Speedup is nice, but you should take a look at performance. Matmul can run at 90+ percent of peak. Use MKL or any optimized BLAS, and compare that to what you get.
SIMD makes no difference in speedup, only in absolute performance.
In that i,j,k
update you're not stating which index is in the inner loop. Regardless, let your compiler generate an optimization report. You'll find that compilers are that clever, and may very well have interchanged loops to get vectorization.
FP latency is not a concern in kernels as simple as yours. Between prefetching and out-of-order execution and whatnot you really don't have to worry much about latency.
Really: your performance is bandwidth-limited. But you're only measuring speedup so you don't really see that, apart from the fact that using all your cores will saturate your bandwidth and cap your speedup.
Upvotes: 1
Reputation: 50956
Assuming your process perfectly scale, the number of instructions will be shared amongst all cores and the total number of instructions will be the same. The same thing applies for the number of cycles. However, when your process does not scale, the number of instruction should be the same but the number of cycle increase. This is generally due to the contention of a shared resource that cause stall cycles in the pipeline.
In your parallel implementation, the memory hierarchy is not efficiently used since your parallel implementation cause a lot of cache misses that could saturate the L3 or the RAM when many threads are used, hence memory stalls. If you use simultaneous multithreading (aka Hyper-threading), this can also cause this problem because two threads on the same core often does not run truly in parallel (some part of the core are shared between threads).
Upvotes: 3
Reputation: 366036
Constant instructions makes sense; there's the same amount of total work to do, whether those instructions all run on the same core or not.
Are you using SMT such as hyperthreading? IPC goes down when the same physical core divides its time between two logical cores. For some programs scaling, SMT increases overall throughput; for cache-bound programs (like a naive matmul) having 2 threads compete for the same L1/L2 cache hurts.
Otherwise, if this is 16 physical cores, you might be seeing that effect for L3 contention, when a single thread doesn't have all the L3 to itself.
Anyway, with proper cache-blocking, SMT usually hurts overall throughput for matmul / dense linear algebra. A physical core can be saturated with ALU work by one thread with well-tuned code, so contention for per-core caches just hurts. (But multi-threading definitely helps overall time, like it did in your case.) A cache-blocked matmul is usually 5 nested loops, as in the example near the end of What Every Programmer Should Know About Memory?
Agner Fog (https://agner.org/optimize/) also mentions hyperthreading hurting for some workloads in his microarch PDF.
Upvotes: 1