Reputation: 15588
As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.
The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do
magic = f.read(4)
I get
b'\xcf\xfa\xed\xfe'
This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.
Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.
Upvotes: 15
Views: 34724
Reputation: 1044
I would suggest the Construct
module. It offers a very high level interface.
Upvotes: 6
Reputation: 203
There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import
it, and, voila, parsing boils down to:
from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections
They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class
or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.
It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:
Upvotes: 14