Using perf stat for multithreaded applications

I have wrote a basic three loop matrix multiplication using a serial and OpenMP implementations. For the same size (3200x3200), the perf stat -a -e instructions,cycles shows:

Serial

   265,755,992,060        instructions              #    0.71  insn per cycle
   375,319,584,656        cycles

      85.923380841 seconds time elapsed

Parallel (16 threads)

   264,524,937,733        instructions              #    0.30  insn per cycle
   883,342,095,910        cycles

      13.381343295 seconds time elapsed

In the parallel run, I expect that the number of cycles roughly be the same as serial run. But it isn't.

Any thoughts for the differences?

UPDATE:

I rerun the experiments with 8 and 16 threads since the processor has maximum 16 threads.

Using 8 threads
Max nthread = 16
Total execution Time in seconds: 13.4407235400
MM execution Time in seconds: 13.3349801241

 Performance counter stats for 'system wide':

            906.51 Joules power/energy-pkg/
   264,995,810,457        instructions              #    0.59  insn per cycle
   449,772,039,792        cycles

      13.469242993 seconds time elapsed

and

Using 16 threads
Max nthread = 16
Total execution Time in seconds: 13.2618084711
MM execution Time in seconds: 13.1565077840

 Performance counter stats for 'system wide':

          1,000.39 Joules power/energy-pkg/
   264,309,881,365        instructions              #    0.30  insn per cycle
   882,881,366,456        cycles

      13.289234564 seconds time elapsed

As you see, the wall clocks are the same, roughly but the cycles for 16 threads are 2 times of 8 threads. That means with higher cycle count and lower IPC, it is possible to keep the wall clock as before with more threads. According to perf list, the event is cpu-cycles OR cycles [Hardware event]. I would like to know is that the average cycles for one core or aggregate N cores? Any comment on that?

Upvotes: 2

Answers (3)

Victor Eijkhout

Reputation: 5852

Matrix-matrix multiplication is the typical example of an operation that theoretically has lots of cache reuse, and so can run close to peak speed. except that the standard three-loop implementation doesn't. Optimized implementations that use the cache efficiently are actually six levels deep: each of your three loops gets tiled, and you need to interchange loops and set the tilings just right. This is not trivial.

So your implementation basically uses no cache. Maybe you can have some effects from the L3 cache, but, given a sufficient problem size, certainly not L1, and likely not L2. In other words: you are bandwidth-bound. And then the problem is that you may have 16 cores, but you don't have enough bandwidth to feed all of those.

I must say that your factor of 6--7 is a little disappointing, but maybe your architecture just doesn't have enough bandwidth. I know that for top-of-the-line nodes I would expect something like 12. But why don't you test it? Write a benchmark that does nothing but stream data to and from memory. Vector-vector addition. And then see how many cores you can get linear speed up.

To address some points you raise in your reply:

Speedup is nice, but you should take a look at performance. Matmul can run at 90+ percent of peak. Use MKL or any optimized BLAS, and compare that to what you get.
SIMD makes no difference in speedup, only in absolute performance.
In that i,j,k update you're not stating which index is in the inner loop. Regardless, let your compiler generate an optimization report. You'll find that compilers are that clever, and may very well have interchanged loops to get vectorization.
FP latency is not a concern in kernels as simple as yours. Between prefetching and out-of-order execution and whatnot you really don't have to worry much about latency.

Really: your performance is bandwidth-limited. But you're only measuring speedup so you don't really see that, apart from the fact that using all your cores will saturate your bandwidth and cap your speedup.

Upvotes: 1

Jérôme Richard

Reputation: 50956

Assuming your process perfectly scale, the number of instructions will be shared amongst all cores and the total number of instructions will be the same. The same thing applies for the number of cycles. However, when your process does not scale, the number of instruction should be the same but the number of cycle increase. This is generally due to the contention of a shared resource that cause stall cycles in the pipeline.

In your parallel implementation, the memory hierarchy is not efficiently used since your parallel implementation cause a lot of cache misses that could saturate the L3 or the RAM when many threads are used, hence memory stalls. If you use simultaneous multithreading (aka Hyper-threading), this can also cause this problem because two threads on the same core often does not run truly in parallel (some part of the core are shared between threads).

Upvotes: 3

Peter Cordes

Reputation: 366036

Constant instructions makes sense; there's the same amount of total work to do, whether those instructions all run on the same core or not.

Are you using SMT such as hyperthreading? IPC goes down when the same physical core divides its time between two logical cores. For some programs scaling, SMT increases overall throughput; for cache-bound programs (like a naive matmul) having 2 threads compete for the same L1/L2 cache hurts.

Otherwise, if this is 16 physical cores, you might be seeing that effect for L3 contention, when a single thread doesn't have all the L3 to itself.

Anyway, with proper cache-blocking, SMT usually hurts overall throughput for matmul / dense linear algebra. A physical core can be saturated with ALU work by one thread with well-tuned code, so contention for per-core caches just hurts. (But multi-threading definitely helps overall time, like it did in your case.) A cache-blocked matmul is usually 5 nested loops, as in the example near the end of What Every Programmer Should Know About Memory?

Agner Fog (https://agner.org/optimize/) also mentions hyperthreading hurting for some workloads in his microarch PDF.

Upvotes: 1

Using perf stat for multithreaded applications

Answers (3)

Related Questions