Reputation: 2767
Recently I run into a problem with Intel TBB scalable allocator. The basic usage pattern is as the following,
N * sizeof(double)
are allocatedM
such that M >= N / 2 && M <= N
.M
elements of each vector are accessed.I set M
to be random because I did not want to benchmark performance for fixed length. Instead, I want to get an average performance over a range of lengths of vectors.
The program performance differs greatly for different values of N
. This is not uncommon as the functions I am testing were designed to prioritized performance for large N
. However, when I tried to benchmark the relation between performance and N
, I found that at a certain point, there's a two folds difference when N
increases from 1016
to 1017
.
My first instinct is that the performance degeneration for N = 1016
has nothing to do with smaller vector size, instead it has something to do cache. Most likely there's a false sharing. The function under tasting make uses SIMD instructions but no stack memory. It reads one 32 bytes element from the first vector, and after computation, it writes 32 bytes to the second (and the third) vector. If a false sharing occurs, likely a few dozen cycles are lost, and that is exactly the performance penalty I am observing. Some profiling confirms this.
Originally I aligned each vector with a 32 bytes boundary, for AVX instructions. To solve the problem, I aligned the vectors with 64 bytes boundary. However, I still observe the same performance penalty. Align by 128 bytes solve the problem.
I did some more digging. Intel TBB has an cache_aligned_allocator
. In its source, memory are aligned by 128 bytes as well.
This is what I don't understand. If I am not mistaken, modern x86 CPU's has a cache line of 64 bytes. CPUID
confirms that. The following is the basic cache information of the CPU in use, extracted from a small program I wrote using CPUID to check features,
Vendor GenuineIntel
Brand Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
====================================================================================================
Deterministic Cache Parameters (EAX = 0x04, ECX = 0x00)
----------------------------------------------------------------------------------------------------
Cache level 1 1 2 3 4
Cache type Data Instruction Unified Unified Unified
Cache size (byte) 32K 32K 256K 6M 128M
Maximum Proc sharing 2 2 2 16 16
Maximum Proc physical 8 8 8 8 8
Coherency line size (byte) 64 64 64 64 64
Physical line partitions 1 1 1 1 16
Ways of associative 8 8 8 12 16
Number of sets 64 64 512 8192 8192
Self initializing Yes Yes Yes Yes Yes
Fully associative No No No No No
Write-back invalidate No No No No No
Cache inclusiveness No No No Yes No
Complex cache indexing No No No Yes Yes
----------------------------------------------------------------------------------------------------
In addition, in Intel TBB's source, the 128 bytes alignment was marked by a comment saying that was for backward compatibility.
So why was 64 bytes alignment not sufficient in my case?
Upvotes: 4
Views: 1564
Reputation: 16099
You are hit by conflict misses. The reason it happens when you go from 1016 to 1017 is that you are then starting to use the last cache-line in the associate list.
Your cache is 32K 8-way so each set is 4K. Your cache line of 64 Bytes can hold 8 doubles. But your 1017-1024 vector uses 8K not 4K??? 1024*sizeof(double), well your using N/2->N so your using (except when exactly N/2) some of the same lower address bit combinations twice for each vector.
You don't get the problem of conflict hits until you have used all of your L1-cache, which your close to doing now. Using 1 vector for reading and 2 vectors for writing, all 8K long, so using 24K, if your using 8K+ of extra data during your computation you will get an increasing chance of evicting your chosen data.
Mind you are only using the first part of the vectors, but they conflict never the less.
You will be able to observe this as an increase in L1-cache misses when you go from 1016 to 1017. When you exceed 1024 doubles the performance penalty should disappear for a short while, until you hit L1-cache capacity misses.
< imaging a graph here that shows a spike at when all 8 sets are used >
From the fantastic article by Ulrich Drepper: "Memory part 5: What programmers can do"
Upvotes: 4