Reputation: 321
I was reading Game Coding Complete 4th edition. There was a topic regarding Memory alignment. In the code below the author says that first struct is really slow because it is both not bit-aligned nor byte-aligned. The second one is not bit-aligned but byte-aligned. The last one is fast because it's both. He says without pragma, compiler will align the memory itself which causes waste of memory. I couldn't really get the calculations.
This is some portion from the text:-
If the compiler were left to optimize SlowStruct by adding unused bytes, each structure would be 24 bytes instead of just 14. Seven extra bytes are padded after the first char variable, and the remaining bytes are added at the end. This ensures that the entire structure always starts on an 8-byte boundary. That’s about 40 percent of wasted space, all due to a careless ordering of member variables.
This is the concluding line in bolds:-
Don’t let the compiler waste precious memory space. Put some of your brain cells to work and align your own member variables.
Please show me calculations and explain the padding concept more clearly.
Code:-
#pragma pack(push, 1)
struct ReallySlowStruct
{
char c : 6;
__int64 d : 64;
int b : 32;
char a : 8;
};
struct SlowStruct
{
char c;
__int64 d;
int b;
char a;
};
struct FastStruct
{
__int64 d;
__int b;
char a;
char c;
char unused[2];
};
#pragma pack(pop)
Upvotes: 30
Views: 50134
Reputation: 42716
As a correction to the accepted answer, writing SlowStruct::d
will NOT happen one byte at a time on x86 or most other instruction sets because the MOV instruction is perfectly happy to operate on an unaligned address. There's no difference in the assembly code.
Compiling this on either Clang-19.1.0 or GCC 14.2:
void Foo(SlowStruct* s) {
s->d *= 9919;
}
Gives me this (omitting the surrounding fn call boilerplate):
imul rcx,QWORD PTR [rax+0x1],0x9
mov QWORD PTR [rax+0x1],rcx
On really old chips, there might have been a performance penalty when operating on addresses that had a smaller alignment than the access size, but modern chips rarely seem to have an instruction level penalty for unalignment, so the main cost comes from cases where the data you are accessing straddles more than one cache line, as this causes the CPU to perform two cache transactions, and potentially 2 x 64 byte line fills from DRAM for a single 2/4/8 byte value (this is a big cost when it happens). With the SIMD instruction sets (MMX/AVX/etc), even though there's both aligned and unaligned instructions, such as VMOVAPS / VMOVUPS, this is largely vestigial because the unaligned instruction is the same speed as the aligned instruction when operating on aligned addresses. Thus, the only reason to use the aligned instruction at all is if you want to deliberately crash on unaligned addresses as a debugging technique.
The code example is this question is very strange because it wildly over-complicates the optimal solution
Delete the #pragma
lines that tell the compiler to disable struct padding optimizations, and this simple direct expression has an identical layout as FastStruct:
struct BetterFastStruct {
std::int64_t d;
int b;
char a;
char c;
};
Without the pragma, the compiler automatically inserts the two extra padding bytes without you having to maintain that arithmetic, and this behavior is important enough that it's probably guaranteed in the language spec. I can't imagine a compiler not doing this.
Coding it the way your book suggests just makes your code harder to read for no benefit. It seems like programming mal-practice to apply packed to a struct and to then explicitly add back the same padding that compiler would have maintained automatically. For example, if you later edit this code and add an extra char field, the compiler-maintained padding will do the right thing and decrease to 1 byte of padding so the struct is still a nicely aligned 16 bytes, but if you forget to update your manually maintained padding, then you get a pointlessly bloated 17 byte struct that is strictly worse in every respect. Why take on that burden for no upside?
The reason to apply packed is if you want to compress an array of O(N) structs into fewer bytes, and you are willing to pay for that with reduced random access efficiency due to mis-alignment causing some of the structs to straddle cache lines.
Before I explain the nuanced answer, I want to push back on a second maintainability problem in that code.
Applying #pragma pack
to an entire span of code including multiple structs is a TERRIBLE PRACTICE. I missed the first pragma line while typing up my initial answer, and it's inevitable that if there are 2+ people working on the project, someone will thoughtlessly add a new struct inside the span covered by that pragma without realizing the implications of that pragma being there. Or even with a single developer, it's easy to forget what you did 6 months ago. That push/pop pattern is cancerous for maintainability.
In general, whenever you are thwarting compiler optimizations for performance reasons, you should be screaming it from the rooftops, on every struct, and including clear comments explaining your rationale. I strongly suggest preferring the more maintainable attribute pattern over that pragma push/pop:
// PERFORMANCE NOTE: we want this structure to be packed because X, Y, and Z,
// and this improves our blah_blah_threaded benchmark by +15%.
struct __attribute__ ((__packed__)) MyPackedStruct {
...
};
Feel free to clean up the attribute syntax with a macro definition:
#define MYPROJ_PACKED __attribute__ ((__packed__))
// PERFORMANCE NOTE: we want this structure to be packed because X, Y, and Z,
// and this improves our blah_blah_threaded benchmark by +15%.
struct MYPROJ_PACKED MyPackedStruct {
...
};
Both #pragma
and __attribute__
are formally compiler-specific "non-standard" mechanisms, so some attributes and pragmas only exist on one compiler, or having differing names between GCC vs Clang, but this attribute doesn't seem to be one of those cases. Both the attribute and the pragma statements worked on Clang and GCC going back many versions until the compiler is so old that something else was missing, like std::int64_t
(tested via Godbolt). So you can use them if you know what you are doing.
OK, the nuanced answer.
The differences between SlowStruct and FastStruct are a bit subtle and depend on whether or not you are applying the pragma.
With the pragma, the main difference is the struct sizes:
// With the #pragma
sizeof(SlowStruct) == 14
sizeof(FastStruct) == 16
The pragma causes SlowStruct to be 14 bytes, and the 64 byte cache line size isn't evenly divisible by 14, so a std::vector<SlowStruct>
will be laid out such that roughly every 5th instance straddles a cache-line boundary, which can double the cost of random uncached accesses to those instances.
Given that the entire struct is alignment-2 (we can ignore the factor of 7), it hardly matters that some of its 4 and 8 byte fields are unaligned to the struct boundary. Even nominally aligned 4 and 8 byte fields in that struct would only have alignment-2 whenever the SlowStruct instances stored in an array. And again, alignment mostly only matters when it causes data to straddle cache lines.
The pointer arithmetic to multiply by 14 is marginally more expensive than the simple bit-shift needed to multiply by 16, but this won't be a material effect in most cases. Certainly not in the cases where we need to care about the alignment increasing cache misses.
Removing the pragma, the main differences are still the struct sizes:
// Without the #pragma
sizeof(SlowStruct) == 24
sizeof(FastStruct) == 16
This:
struct SlowStruct {
char c;
std::int64_t d;
int b;
char a;
};
... is automatically padded to this:
struct SlowStruct_as_compiled {
char c;
char _padding0[7];
std::int64_t d;
int b;
char a;
char _padding1[3];
};
static_assert(sizeof(SlowStruct) == sizeof(SlowStruct_as_compiled));
I just validated this on Godbolt and the static_assert
passes with both GCC and Clang, and I'd be surprised to find a compiler where it fails. I suspect it's a guarantee of the language spec.
The fields in SlowStruct are now aligned such that no single field access will ever straddle a cache line boundary, even when I've got an array of SlowStruct instances.
But SlowStruct still might be slower because a) 24 bytes is bulkier than 16 bytes, so it could cause the data to fall out of cache more easily, and b) 64 isn't evenly divisible by 24, so 24 byte structs can straddle two cache lines, costing 2 * 64 bytes to read, versus a 16 byte struct is nearly always going to be aligned such that it's exactly 1*64 byte line fill from DRAM (you have to work hard to make this not true). FWIW, this post explains how to think about DRAM read granularity / counting line-fills.
Or again, you could just do this:
struct BetterFastStruct {
std::int64_t d;
int b;
char a;
char c;
};
In Summary:
Grouping fields by their size can be helpful for minimizing alignment overhead.
Sprinkling individual char
fields between your other 4 and 8 byte fields is a bad plan and makes my skin crawl.
Padding the struct yourself is almost never worthwhile, and the few times you care to pad to a larger size, you should be using the alignas
keyword on the struct itself, not adding junk fields. It's hard for me to imagine a case where adding junk padding fields to the middle of a struct is a sensible decision.
Using the "packed" attribute/pragma is something I've done on a handful of occasions, when I cared more about compressing an array of O(N) structs into fewer bytes than I cared about the increased cost of accessing those structs due to the mis-alignment sometimes causing them to straddle cache lines.
I don't always pack my structs, but when I do, I ALWAYS use the attribute syntax.
Upvotes: 0
Reputation: 965
The examples given in the book are highly dependent on the used compiler and computer architecture. If you test them in your own program you may get totally different results than the author. I will assume a 64-bit architecture, because the author does also, from what I've read in the description. Lets look at the examples one by one:
ReallySlowStruct IF the used compiler supports non-byte aligned struct members, the start of "d" will be at the seventh bit of the first byte of the struct. Sounds very good for memory saving. The problem with this is, that C does not allow bit-adressing. So to save newValue to the "d" member, the compiler must do a whole lot of bit shifting operations: Save the first two bits of "newValue" in byte0, shifted 6 bits to the right. Then shift "newValue" two bits to the left and save it starting at byte 1. Byte 1 is a non-aligned memory location, that means the bulk memory transfer instructions won't work, the compiler must save every byte at a time.
SlowStruct It gets better. The compiler can get rid of all the bit-fiddling. But writing "d" will still require writing every byte at a time, because it is not aligned to the native "int" size. The native size on a 64-bit system is 8. so every memory address not divisable by 8 can only be accessed one byte at a time. And worse, if I switch off packing, I will waste a lot of memory space: every member which is followed by an int will be padded with enough bytes to let the integer start at a memory location divisable by 8. In this case: char a and c will both take up 8 bytes.
FastStruct this is aligned to the size of int on the target machine. "d" takes up 8 bytes as it should. Because the chars are all bundled at one place, the compiler does not pad them and does not waste space. chars are only 1 byte each, so we do not need to pad them. The complete structure adds up to an overall size of 16 bytes. Divisable by 8, so no padding needed.
In most scenarios, you never have to be concerned with alignment because the default alignment is already optimal. In some cases however, you can achieve significant performance improvements, or memory savings, by specifying a custom alignment for your data stuctures.
In terms of memory space, the compiler pads the structure in a way that naturally aligns each element of the structure.
struct x_
{
char a; // 1 byte
int b; // 4 bytes
short c; // 2 bytes
char d; // 1 byte
} bar[3];
struct x_
is padded by the compiler and thus becomes:
// Shows the actual memory layout
struct x_
{
char a; // 1 byte
char _pad0[3]; // padding to put 'b' on 4-byte boundary
int b; // 4 bytes
short c; // 2 bytes
char d; // 1 byte
char _pad1[1]; // padding to make sizeof(x_) multiple of 4
} bar[3];
Source: https://learn.microsoft.com/en-us/cpp/cpp/alignment-cpp-declarations?view=vs-2019
Upvotes: 47