Maxbit
Maxbit

Reputation: 449

Align struct while minimizing cache line waste

There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).

Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.

For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.

However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.

My question is two-fold:

  1. In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:

enter image description here

where the upper box is a cache line and the other boxes are instances of our small struct.

  1. Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?

Thank you so much for your input here. It is much appreciated.

Upvotes: 1

Views: 2981

Answers (2)

Charles Savoie
Charles Savoie

Reputation: 356

To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:

    #include <array>
    #include <iostream>
    
    using namespace std;
    
    template <typename T, size_t BlockAlignment>
    struct __attribute__((packed)) Packer
    {
        static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
        static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
        T &operator[]( size_t index ) { return packed[index]; }
        
        T packed[ NUM_ELEMS ];
        uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
    };
    
    template <typename T, size_t NumElements, size_t BlockAlignment>
    struct alignas(BlockAlignment) PackedAlignedArray
    {
        typedef Packer<T, BlockAlignment> PackerType;
        std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
        T &operator[]( size_t index ) { 
            return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
        }
    };
    
    struct __attribute__((packed)) Foo
    {
        uint64_t a;
        uint64_t b;
        uint8_t c;
    };
    
    int main()
    {
        static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
        constexpr size_t NUM_ELEMENTS = 10;
        constexpr size_t BLOCK_ALIGNMENT = 64;

        PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;

        for ( size_t i=0; i<NUM_ELEMENTS; ++i )
        {
            // Display the memory offset between the current
            // element and the start of the array
            cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) - 
                reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
        }
    
        return 0;
    }

The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:

0
17
34
64
81
98
128
145
162
192

Upvotes: 2

Peter Cordes
Peter Cordes

Reputation: 364210

You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed)) or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).

Then put it in an alignas(256) T array[], so only the start of the array is aligned.

Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr being the start of a valid array, but arr + 1 no being a valid array, even though it has the same type.


Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct) = max (alignof(members)).

See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.


Depending on how your data is accessed, one option would be to store pairs of uint64_t in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.

But for iterating over your data sequentially, that's excellent.

Upvotes: 1

Related Questions