Reputation: 449
There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).
Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t
and one uint8_t
). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.
For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.
However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.
My question is two-fold:
alignas
, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:where the upper box is a cache line and the other boxes are instances of our small struct.
Thank you so much for your input here. It is much appreciated.
Upvotes: 1
Views: 2981
Reputation: 356
To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:
#include <array>
#include <iostream>
using namespace std;
template <typename T, size_t BlockAlignment>
struct __attribute__((packed)) Packer
{
static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
T &operator[]( size_t index ) { return packed[index]; }
T packed[ NUM_ELEMS ];
uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
};
template <typename T, size_t NumElements, size_t BlockAlignment>
struct alignas(BlockAlignment) PackedAlignedArray
{
typedef Packer<T, BlockAlignment> PackerType;
std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
T &operator[]( size_t index ) {
return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
}
};
struct __attribute__((packed)) Foo
{
uint64_t a;
uint64_t b;
uint8_t c;
};
int main()
{
static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
constexpr size_t NUM_ELEMENTS = 10;
constexpr size_t BLOCK_ALIGNMENT = 64;
PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;
for ( size_t i=0; i<NUM_ELEMENTS; ++i )
{
// Display the memory offset between the current
// element and the start of the array
cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) -
reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
}
return 0;
}
The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:
0
17
34
64
81
98
128
145
162
192
Upvotes: 2
Reputation: 364210
You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed))
or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).
Then put it in an alignas(256) T array[]
, so only the start of the array is aligned.
Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr
being the start of a valid array, but arr + 1
no being a valid array, even though it has the same type.
Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct)
= max (alignof(members))
.
See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.
Depending on how your data is accessed, one option would be to store pairs of uint64_t
in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.
But for iterating over your data sequentially, that's excellent.
Upvotes: 1