Paralellization vs vectorization performance bottlenec: Does AVX and MT compete?

Question

I tried to compute the sum of all elements in a large matrix. Here are the test cases:

MT and AVX takes 37 s
MT and no AVX takes 40 s
AVX and no MT takes 49 s
Neither AVX or MT 105 s

In all cases, the CPU clock is fixed to 3.0 GHz (claimed by cpufreq-info):

current policy: frequency should be within 1.60 GHz and 3.40 GHz.
                The governor "userspace" may decide which speed to use
                 within this range.
current CPU frequency is 3.00 GHz.

The matrix has 25000000 elements of type double and value 1.0. And the sum is computed repeatedly 4096 times in a loop. Without AVX, the speed improvement when using MT is 2.6. With AVX it is only 1.3. When running MT, the matrix is divided into 4 blocks, one per thread. If I reduce the CPU frequency, the MT improvement is larger for AVX, so there might be some issue with cache misses also, but that cannot explain the difference between (4)/(2) and (3)/(1). Does AVX and MT compete with each other in some way? The chip is i3570K.

Leeor · Accepted Answer

It's quite possible that your baseline performance was bounded by execution latency, but either form of parallelization (MT or vectorization) allowed you to break that and reach the next bottleneck which is the memory BW of your CPU.

Check the peak BW your CPU can reach and compare with your data, looks like you're simply saturating at 20.5GB/s (25000000 elements * 4096 loops * 8Bytes assuming that's what your system uses for double / ~40 seconds), which seems a little low as this link says it should reach 25GB/s, but around the same ballpark so it could be due to other inefficiencies, like type of DDR, other apps / OS working in the background, frequency scaling done by the CPU to save power / reduce heat, etc..

You could also try running some memory benchmarks (lmbench, sandra, ..) and see if they do better under the same environment.

Paralellization vs vectorization performance bottlenec: Does AVX and MT compete?

Answers (2)

Related Questions