Reputation: 6200
I tried to compute the sum of all elements in a large matrix. Here are the test cases:
In all cases, the CPU clock is fixed to 3.0 GHz (claimed by cpufreq-info):
current policy: frequency should be within 1.60 GHz and 3.40 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 3.00 GHz.
The matrix has 25000000 elements of type double
and value 1.0. And the sum is computed repeatedly 4096 times in a loop. Without AVX, the speed improvement when using MT is 2.6. With AVX it is only 1.3. When running MT, the matrix is divided into 4 blocks, one per thread. If I reduce the CPU frequency, the MT improvement is larger for AVX, so there might be some issue with cache misses also, but that cannot explain the difference between (4)/(2) and (3)/(1). Does AVX and MT compete with each other in some way? The chip is i3570K.
Upvotes: 3
Views: 641
Reputation: 811
MT should not compete with MT, they are two different things. Although the summation idea is simple but depending on your implementation you can get very different numbers. I suggest you use the Stream benchmarks to test performance as they are the standard. I don't see your code but there are some issues:
Upvotes: 0
Reputation: 19706
It's quite possible that your baseline performance was bounded by execution latency, but either form of parallelization (MT or vectorization) allowed you to break that and reach the next bottleneck which is the memory BW of your CPU.
Check the peak BW your CPU can reach and compare with your data, looks like you're simply saturating at 20.5GB/s (25000000 elements * 4096 loops * 8Bytes assuming that's what your system uses for double / ~40 seconds), which seems a little low as this link says it should reach 25GB/s, but around the same ballpark so it could be due to other inefficiencies, like type of DDR, other apps / OS working in the background, frequency scaling done by the CPU to save power / reduce heat, etc..
You could also try running some memory benchmarks (lmbench, sandra, ..) and see if they do better under the same environment.
Upvotes: 3