Reputation: 133

How impacting is "Packing" structures on performance

It isn't my goal to start micro-optimizing, so if that's what this turns in to, I'll gladly drop the question. But I'm about to start making some design decisions and want to be more informed.

I am reading and processing a file format which contains numerous data structures that are documented in a well defined format. I've represented them in code as structs.

Now, if I pack the structs one 1-byte alignment with #pragma pack(1), I can read the structures off the IO stream directly on to struct pointers. This is convenient. If I don't pack the structures, I can either fread the fields one by one or fread blocks at a time and reinterpret_cast the struct fields one-by-one, which will probably get old fast.

For reference, the structs will be read (potentially) by the thousands and could have some number crunching done on them. They're mostly comprised of unsigned 16 bit integers (about 60%), unsigned 32 bit integers (about 30%) and some 64 bit integers.

So the question at hand is, do I...

Conduct tens of thousands of tiny calls fread?
Read chunks and copy over the relevent bytes?
Pack the structs and read directly on to them?

Upvotes: 5

Answers (3)

Mark Ransom

Reputation: 308528

The code will be clearest if you just use packing and read into the structs directly. That's also likely to be fastest to read. Unfortunately it can also be a source of bugs, especially if the layout of the structure changes in the future.

Alignment of the elements can be an issue, or it might not be, depending on many factors. If the elements are sorted by size with the largest ones first, alignment isn't likely to be a problem. If the source produced the stream of bytes by directly writing the entire structure, it was also likely properly aligned for that system and might work perfectly well on your end. The x86 architecture can handle misalignment pretty well, with only a minor slowdown worst case; even that's minimized by the cache structure where an entire cache line is loaded at once, guaranteeing most of the bytes will already be in cache. Other architectures may not handle misalignment at all, but you'll know pretty quickly if that happens.

If you need different endianness than the source, you can call a function on each element of the structure to fix them individually. At that point the simplicity and clarity of the direct read will be diminished and you might be better off with the other method.

Upvotes: 1

Mats Petersson

Reputation: 129524

Ultimately, the performance difference between solution A and solution B can only be determined by a benchmarking. Asking on the internet will give you variable results that may or may not reflect the reality in your case.

What happens when you "misalign" data is that the processor needs to do multiple reads [and the same applies for writes] for one piece of data. Exactly how much extra time that takes depends on the processor - some processors don't do it automatically, so the runtime system will trap the "bad read" and perform the read in some emulation layer [or, in some processors, simply kill the process for "unaligned memory access"]. Clearly, taking a trap and doing a couple of read operations then returnding to the calling code is a pretty significant impact on performance - it can easily take hundreds of cycles longer than an aligned read operation.

In the case of x86, it "works just like you'd expect", but with a penalty of typically 1 extra clock cycle [assuming data is already in L1 cache]. One clock cycle isn't very much in a modern processor, but if the loop is 10000000000000 iterations long and reads unaligned data n times, you have now added n * 10000000000000 clock-cycles to the execution time, which may be significant.

The other alternatives also have impact on performance. Doing a lot of small reads is likely A LOT slower than doing one large read. A conversions function is LIKELY better from a performance perspective.

Again, please don't take this as a "given", you really need to compare the different solutions (or pick one, and if the performance doesn't suck, and the code isn't horrible looking, leave it at that). I'm fairly convinced you could find cases for every one of the three solutions you suggest being "best".

Also bear in mind that #pragma pack is compiler specific, and it's not easy to achieve macros that allow you to select between the "Microsoft" and "gcc" solution, for example. Edit: it would appear that more recent gcc versions do support this option - but not ALL compilers do.

Upvotes: 5

Mark B

Reputation: 96311

Per your comment to another answer, your code intends to be platform agnostic and the endian-ness of the file format is clearly specified. In this case, reading directly into a packed struct loses much of its clarity because it will require an after-read endian-cleanup step or else result in incorrect data on architectures with different endian-ness than the file format.

Assuming that you always know the number of bytes (probably from a struct type indicator in the file) I would suggest using a factory pattern where the created object's constructor knows how to pull bytes out of a memory buffer attribute by attribute (if the file is small enough you can just read the entire thing into a buffer than then do a loop/factory-create/deserialize-into-object-via-constructor. This way you can control the endian-ness and allow the compiler's desired struct alignment.

Upvotes: 2

How impacting is &quot;Packing&quot; structures on performance

Answers (3)

Related Questions

How impacting is "Packing" structures on performance