Yan Zhou
Yan Zhou

Reputation: 2767

Cache aligned memory allocation on Intel CPU

Recently I run into a problem with Intel TBB scalable allocator. The basic usage pattern is as the following,

  1. A few vectors of size N * sizeof(double) are allocated
  2. Generate an random integer M such that M >= N / 2 && M <= N.
  3. The first M elements of each vector are accessed.
  4. Repeat step 2. for 1000 times.

I set M to be random because I did not want to benchmark performance for fixed length. Instead, I want to get an average performance over a range of lengths of vectors.

The program performance differs greatly for different values of N. This is not uncommon as the functions I am testing were designed to prioritized performance for large N. However, when I tried to benchmark the relation between performance and N, I found that at a certain point, there's a two folds difference when N increases from 1016 to 1017.

My first instinct is that the performance degeneration for N = 1016 has nothing to do with smaller vector size, instead it has something to do cache. Most likely there's a false sharing. The function under tasting make uses SIMD instructions but no stack memory. It reads one 32 bytes element from the first vector, and after computation, it writes 32 bytes to the second (and the third) vector. If a false sharing occurs, likely a few dozen cycles are lost, and that is exactly the performance penalty I am observing. Some profiling confirms this.

Originally I aligned each vector with a 32 bytes boundary, for AVX instructions. To solve the problem, I aligned the vectors with 64 bytes boundary. However, I still observe the same performance penalty. Align by 128 bytes solve the problem.

I did some more digging. Intel TBB has an cache_aligned_allocator. In its source, memory are aligned by 128 bytes as well.

This is what I don't understand. If I am not mistaken, modern x86 CPU's has a cache line of 64 bytes. CPUID confirms that. The following is the basic cache information of the CPU in use, extracted from a small program I wrote using CPUID to check features,

Vendor    GenuineIntel
Brand     Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

====================================================================================================
Deterministic Cache Parameters (EAX = 0x04, ECX = 0x00)
----------------------------------------------------------------------------------------------------
Cache level                             1           1           2           3           4           
Cache type                              Data        Instruction Unified     Unified     Unified     
Cache size (byte)                       32K         32K         256K        6M          128M        
Maximum Proc sharing                    2           2           2           16          16          
Maximum Proc physical                   8           8           8           8           8           
Coherency line size (byte)              64          64          64          64          64          
Physical line partitions                1           1           1           1           16          
Ways of associative                     8           8           8           12          16          
Number of sets                          64          64          512         8192        8192        
Self initializing                       Yes         Yes         Yes         Yes         Yes         
Fully associative                       No          No          No          No          No          
Write-back invalidate                   No          No          No          No          No          
Cache inclusiveness                     No          No          No          Yes         No          
Complex cache indexing                  No          No          No          Yes         Yes         
----------------------------------------------------------------------------------------------------

In addition, in Intel TBB's source, the 128 bytes alignment was marked by a comment saying that was for backward compatibility.

So why was 64 bytes alignment not sufficient in my case?

Upvotes: 4

Views: 1564

Answers (1)

Surt
Surt

Reputation: 16099

You are hit by conflict misses. The reason it happens when you go from 1016 to 1017 is that you are then starting to use the last cache-line in the associate list.

Your cache is 32K 8-way so each set is 4K. Your cache line of 64 Bytes can hold 8 doubles. But your 1017-1024 vector uses 8K not 4K??? 1024*sizeof(double), well your using N/2->N so your using (except when exactly N/2) some of the same lower address bit combinations twice for each vector.

You don't get the problem of conflict hits until you have used all of your L1-cache, which your close to doing now. Using 1 vector for reading and 2 vectors for writing, all 8K long, so using 24K, if your using 8K+ of extra data during your computation you will get an increasing chance of evicting your chosen data.

Mind you are only using the first part of the vectors, but they conflict never the less.

You will be able to observe this as an increase in L1-cache misses when you go from 1016 to 1017. When you exceed 1024 doubles the performance penalty should disappear for a short while, until you hit L1-cache capacity misses.

< imaging a graph here that shows a spike at when all 8 sets are used >

From the fantastic article by Ulrich Drepper: "Memory part 5: What programmers can do" enter image description here

Upvotes: 4

Related Questions