user877329
user877329

Reputation: 6200

Paralellization vs vectorization performance bottlenec: Does AVX and MT compete?

I tried to compute the sum of all elements in a large matrix. Here are the test cases:

  1. MT and AVX takes 37 s
  2. MT and no AVX takes 40 s
  3. AVX and no MT takes 49 s
  4. Neither AVX or MT 105 s

In all cases, the CPU clock is fixed to 3.0 GHz (claimed by cpufreq-info):

current policy: frequency should be within 1.60 GHz and 3.40 GHz.
                The governor "userspace" may decide which speed to use
                 within this range.
current CPU frequency is 3.00 GHz.

The matrix has 25000000 elements of type double and value 1.0. And the sum is computed repeatedly 4096 times in a loop. Without AVX, the speed improvement when using MT is 2.6. With AVX it is only 1.3. When running MT, the matrix is divided into 4 blocks, one per thread. If I reduce the CPU frequency, the MT improvement is larger for AVX, so there might be some issue with cache misses also, but that cannot explain the difference between (4)/(2) and (3)/(1). Does AVX and MT compete with each other in some way? The chip is i3570K.

Upvotes: 3

Views: 641

Answers (2)

Salah Saleh
Salah Saleh

Reputation: 811

MT should not compete with MT, they are two different things. Although the summation idea is simple but depending on your implementation you can get very different numbers. I suggest you use the Stream benchmarks to test performance as they are the standard. I don't see your code but there are some issues:

  1. you are initializing the matrix with 1.0 for all the elements. I think that is not a good idea. You should use random numbers or at lease initialize based on the index (e.g. (i%10)/10.0).
  2. How do you measure time? you should place your timers out side the repetition loop and take the average over the number of repetition. Also do you use accurate timers?
  3. Did you make sure that your code is actually vectorized? did you enable any compiler flags to display this information? Did you make sure that the AVX version of your code is used? maybe the compiler chose to use the scalar version.
  4. you mentioned that the frequency is fixed, are you sure that the turbo mode is not enabled at any point of time?
  5. What about thread affinity when measuring with MT?

Upvotes: 0

Leeor
Leeor

Reputation: 19706

It's quite possible that your baseline performance was bounded by execution latency, but either form of parallelization (MT or vectorization) allowed you to break that and reach the next bottleneck which is the memory BW of your CPU.

Check the peak BW your CPU can reach and compare with your data, looks like you're simply saturating at 20.5GB/s (25000000 elements * 4096 loops * 8Bytes assuming that's what your system uses for double / ~40 seconds), which seems a little low as this link says it should reach 25GB/s, but around the same ballpark so it could be due to other inefficiencies, like type of DDR, other apps / OS working in the background, frequency scaling done by the CPU to save power / reduce heat, etc..

You could also try running some memory benchmarks (lmbench, sandra, ..) and see if they do better under the same environment.

Upvotes: 3

Related Questions