nmuntz
nmuntz

Reputation: 1158

Reading binary file defined by a struct

Could somebody point me in the right direction of how I could read a binary file that is defined by a C struct? It has a few #define inside of the struct, which makes me thing that it will complicate things.
The structure looks something like this: (although its larger and more complicated than this)

struct Format {
    unsigned long str_totalstrings;
    unsigned long str_name;
    #define STR_ORDERED 0x2
    #define STR_ROT13 0x4
    unsigned char stuff[4];
    #define str_delimiter stuff[0]
}

I would really appreciate it if somebody could point me in the right direction on how to do this. Or if theres any tutorial out there that covers this topic?

Thanks a lot in advance for your help.

Upvotes: 3

Views: 7492

Answers (5)

dpm_min
dpm_min

Reputation: 325

There are some bad ideas and good ideas:

That's a bad idea to:

  • Typecast a raw buffer into struct
    • There are endianness issues (little-endian vs big-endian) when parsing integers >1 byte long or floats
    • There are byte alignment issues in structures, which are very compiler-dependent. One can try to disable alignment (or enforce some manual alignment), but it's generally a bad idea too. At the very least, you'll ruin performance by making CPU access unaligned integers. Internal RISC core would have to do 3-4 ops instead of 1 (i.e. "do part 1 in first word", "do part 2 in second word", "merge the result") to access it every time. Or worse, compiler pragmas to control alignment will be ignored and your code will break.
    • There are no exact size guarantees for regular int, long, short, etc, type in C/C++. You can use stuff like int16_t, but these are available only on modern compilers.
    • Of course, this approach breaks completely when using structures that reference other structures: one has to unroll them all manually.
  • Write parsers manually: it's much harder than it seems on the first glance.
    • A good parser needs to do lots of sanity checking on every stage. It's easy to miss something. It is even easier to miss something if you don't use exceptions.
    • Using exceptions makes you prone to fail if your parsing code is not exception-safe (i.e. written in a way that it can be interrupted at some points and it won't leak memory / forget to finalize some objects)
    • There could be performance issues (i.e. doing lots of unbuffered IO instead of doing one OS read syscall and parsing a buffer then — or vice versa, reading whole thing at once instead of more granular, lazy reads where it's applicable).

It's a good idea to

  • Go cross-platform. Pretty much self-explanatory, with all the mobile devices, routers and IoT stuff booming around in the recent years.
  • Go declarative. Consider using any of declarative specs to describe your structure and then use a parser generator to generate a parser.

There are several tools available to do that:

  • Kaitai Struct — my favorite so far, cross-platform, cross-language — i.e. you describe your structure once and then you can compile it into a parser in C++, C#, Java, Python, Ruby, PHP, etc.
  • binpac — pretty dated, but still usable, C++-only — similar to Kaitai in ideology, but unsupported since 2013
  • Spicy — said to be "modern rewrite" of binpac, AKA "binpac++", but still in early stages of development; can be used for smaller tasks, C++ only too.

Upvotes: 8

bradtgmurray
bradtgmurray

Reputation: 14313

You can also use unions to do this parsing if you have the data you want to parse already in memory.

union A {
    char* buffer;
    Format format;
};

A a;
a.buffer = stuff_you_want_to_parse;

// You can now access the members of the struct through the union.
if (a.format.str_name == "...")
    // do stuff

Also remember that long could be different sizes on different platforms. If you are depending on long being a certain size, consider using the types defined int stdint.h such as uint32_t.

Upvotes: 2

AShelly
AShelly

Reputation: 35600

Reading a binary defined by a struct is easy.

Format myFormat;
fread(&myFormat, sizeof(Format), 1, fp);

the #defines don't affect the structure at all. (Inside is an odd place to put them, though).

However, this is not cross-platform safe. It is the simplest thing that will possibly work, in situations where you are assured the reader and writer are using the same platform.

The better way would be to re-define your structure as such:

struct Format {
    Uint32 str_totalstrings;  //assuming unsigned long was 32 bits on the writer.
    Uint32 str_name;
    unsigned char stuff[4];
};

and then have a 'platform_types.h" which typedefs Uint32 correctly for your compiler. Now you can read directly into the structure, but for endianness issues you still need to do something like this:

myFormat.str_totalstrings = FileToNative32(myFormat.str_totalstrings);
myFormat.str_name =   FileToNative32(str_name);

where FileToNative is either a no-op or a byte reverser depending on platform.

Upvotes: 4

Ferruccio
Ferruccio

Reputation: 100748

Using C++ I/O library:

#include <fstream>
using namespace std;

ifstream ifs("file.dat", ios::binary);
Format f;
ifs.get(&f, sizeof f);

Using C I/O library:

#include <cstdio>
using namespace std;

FILE *fin = fopen("file.dat", "rb");
Format f;
fread(&f, sizeof f, 1, fin);

Upvotes: 2

Nikolai Fetissov
Nikolai Fetissov

Reputation: 84239

You have to find out the endiannes of the machine where the file was written so you can interpret integers properly. Look out for ILP32 vs LP64 mismatch. The original structure packing/alignment might also be important.

Upvotes: 1

Related Questions