Reputation: 4050
This book says the following:
For Knights Landing, memory movement is optimal when the data starting address lies on 64-byte boundaries.
Q1. Is there a way to query the processor in C++ code dynamically to know what this optimal n
-byte boundary would be for the processor on which the application is currently running? That way, the code would be portable.
The book further states:
As programmers, we end up with two jobs: (1)align our data and (2)make sure the compiler knows it is aligned.
(Suppose for the question below that we know that it is optimal for our processor to have data start at 64-byte boundaries.)
What exactly is this "data" though?
Suppose I have a class thus:
class Class1_{
private:
int a;//4 bytes
double b;//8 bytes
std::vector<int> potentially_longish_vector_int;
std::vector<double> potentially_longish_vector_double;
double * potentially_longish_heap_array_double;
public:
//--stuff---//
double * return_heap_array_address() {return potentially_longish_heap_array_double;}
}
Suppose I also have functions that are prototyped thus:
void func1(Class1_& obj_class1);
void func2(double* array);
That is, func1
takes in an object of Class1_
by reference, and func2
is called as func2(obj_class1.return_heap_array_address());
To be consistent with the advice that data should be appropriately boundary aligned, should obj_class1
itself be 64-byte boundary aligned for efficient functioning of func1()
? Should potentially_longish_heap_array_double
be 64-byte boundary aligned for efficient functioning of func2()
?
For alignment of other data members of the class which are STL containers, the thread here suggests how to go about accomplishing the required alignment.
Q2. So, does the object itself need to be appropriately aligned as well as all of the data members within it?
Upvotes: 1
Views: 409
Reputation: 136256
In general, when you align your arrays on a cache line boundary that maximises cache utilisation and that also makes the arrays suitably aligned for any SIMD instructions. That is because the unit of transfer between RAM and CPU caches is a cache line, which is 64 bytes on modern Intel CPUs.
However, increased alignment may also waste memory and reduce cache utilization. Normally only data structures on the critical fast path of your application may require specifying an increased alignment.
It makes sense to arrange members of your classes in {hotness, size} order, so that most frequently accessed members or members accessed together reside on the same cache line.
Optimization objective here is to reduce cache and TLB misses (or, decrease cycles-per-instruction / increase instructions-per-cycle). TLB misses can be reduced by using huge pages.
Upvotes: 5