Data alignment to enable vectorization / efficient cache access

Question

This book says the following:

For Knights Landing, memory movement is optimal when the data starting address lies on 64-byte boundaries.

Q1. Is there a way to query the processor in C++ code dynamically to know what this optimal n-byte boundary would be for the processor on which the application is currently running? That way, the code would be portable.

The book further states:

As programmers, we end up with two jobs: (1)align our data and (2)make sure the compiler knows it is aligned.

(Suppose for the question below that we know that it is optimal for our processor to have data start at 64-byte boundaries.)

What exactly is this "data" though?

Suppose I have a class thus:

class Class1_{
    private: 
    int a;//4 bytes
    double b;//8 bytes
    std::vector potentially_longish_vector_int;
    std::vector potentially_longish_vector_double;
    double * potentially_longish_heap_array_double;
    public:
    //--stuff---//
    double * return_heap_array_address() {return potentially_longish_heap_array_double;}
}

Suppose I also have functions that are prototyped thus:

void func1(Class1_& obj_class1);

void func2(double* array);

That is, func1 takes in an object of Class1_ by reference, and func2 is called as func2(obj_class1.return_heap_array_address());

To be consistent with the advice that data should be appropriately boundary aligned, should obj_class1 itself be 64-byte boundary aligned for efficient functioning of func1()? Should potentially_longish_heap_array_double be 64-byte boundary aligned for efficient functioning of func2()?

For alignment of other data members of the class which are STL containers, the thread here suggests how to go about accomplishing the required alignment.

Q2. So, does the object itself need to be appropriately aligned as well as all of the data members within it?

Maxim Egorushkin · Accepted Answer

In general, when you align your arrays on a cache line boundary that maximises cache utilisation and that also makes the arrays suitably aligned for any SIMD instructions. That is because the unit of transfer between RAM and CPU caches is a cache line, which is 64 bytes on modern Intel CPUs.

However, increased alignment may also waste memory and reduce cache utilization. Normally only data structures on the critical fast path of your application may require specifying an increased alignment.

It makes sense to arrange members of your classes in {hotness, size} order, so that most frequently accessed members or members accessed together reside on the same cache line.

Optimization objective here is to reduce cache and TLB misses (or, decrease cycles-per-instruction / increase instructions-per-cycle). TLB misses can be reduced by using huge pages.

Data alignment to enable vectorization / efficient cache access

Answers (1)

Related Questions