Cache aligned memory allocation on Intel CPU

Question

Recently I run into a problem with Intel TBB scalable allocator. The basic usage pattern is as the following,

A few vectors of size N * sizeof(double) are allocated
Generate an random integer M such that M >= N / 2 && M <= N.
The first M elements of each vector are accessed.
Repeat step 2. for 1000 times.

I set M to be random because I did not want to benchmark performance for fixed length. Instead, I want to get an average performance over a range of lengths of vectors.

The program performance differs greatly for different values of N. This is not uncommon as the functions I am testing were designed to prioritized performance for large N. However, when I tried to benchmark the relation between performance and N, I found that at a certain point, there's a two folds difference when N increases from 1016 to 1017.

My first instinct is that the performance degeneration for N = 1016 has nothing to do with smaller vector size, instead it has something to do cache. Most likely there's a false sharing. The function under tasting make uses SIMD instructions but no stack memory. It reads one 32 bytes element from the first vector, and after computation, it writes 32 bytes to the second (and the third) vector. If a false sharing occurs, likely a few dozen cycles are lost, and that is exactly the performance penalty I am observing. Some profiling confirms this.

Originally I aligned each vector with a 32 bytes boundary, for AVX instructions. To solve the problem, I aligned the vectors with 64 bytes boundary. However, I still observe the same performance penalty. Align by 128 bytes solve the problem.

I did some more digging. Intel TBB has an cache_aligned_allocator. In its source, memory are aligned by 128 bytes as well.

This is what I don't understand. If I am not mistaken, modern x86 CPU's has a cache line of 64 bytes. CPUID confirms that. The following is the basic cache information of the CPU in use, extracted from a small program I wrote using CPUID to check features,

Vendor    GenuineIntel
Brand     Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

====================================================================================================
Deterministic Cache Parameters (EAX = 0x04, ECX = 0x00)
----------------------------------------------------------------------------------------------------
Cache level                             1           1           2           3           4           
Cache type                              Data        Instruction Unified     Unified     Unified     
Cache size (byte)                       32K         32K         256K        6M          128M        
Maximum Proc sharing                    2           2           2           16          16          
Maximum Proc physical                   8           8           8           8           8           
Coherency line size (byte)              64          64          64          64          64          
Physical line partitions                1           1           1           1           16          
Ways of associative                     8           8           8           12          16          
Number of sets                          64          64          512         8192        8192        
Self initializing                       Yes         Yes         Yes         Yes         Yes         
Fully associative                       No          No          No          No          No          
Write-back invalidate                   No          No          No          No          No          
Cache inclusiveness                     No          No          No          Yes         No          
Complex cache indexing                  No          No          No          Yes         Yes         
----------------------------------------------------------------------------------------------------

In addition, in Intel TBB's source, the 128 bytes alignment was marked by a comment saying that was for backward compatibility.

So why was 64 bytes alignment not sufficient in my case?

Cache aligned memory allocation on Intel CPU

Answers (1)

Related Questions