Reputation: 355
I have a binary file made from C structs that I want to parse in Python. I know the exact format and layout of the binary but I am confused on how to use Python Struct unpacking to read this data.
Would I have to traverse the whole binary unpacking a certain number of bytes at a time based on what the members of the struct are?
C File Format:
typedef struct {
int data1;
int data2;
int data4;
} datanums;
typedef struct {
datanums numbers;
char *name;
} personal_data;
Lets say the binary file had personal_data structs repeatedly after another.
Upvotes: 7
Views: 13018
Reputation: 365587
Assuming the layout is a static binary structure that can be described by a simple struct
pattern, and the file is just that structure repeated over and over again, then yes, "traverse the whole binary unpacking a certain number of bytes at a time" is exactly what you'd do.
For example:
record = struct.Struct('>HB10cL')
with open('myfile.bin', 'rb') as f:
while True:
buf = f.read(record.size)
if not buf:
break
yield record.unpack(buf)
If you're worried about the efficiency of only reading 17 bytes at a time and you want to wrap that up by buffering 8K at a time or something… well, first make sure it's an actual problem worth optimizing; then, if it is, loop over unpack_from
instead of unpack
. Something like this (untested, top-of-my-head code):
buf, offset = b'', 0
with open('myfile.bin', 'rb') as f:
if len(buf) < record.size:
buf, offset = buf[offset:] + f.read(8192), 0
if not buf:
break
yield record.unpack_from(buf, offset)
offset += record.size
Or, even simpler, as long as the file isn't too big for your vmsize, just mmap
the whole thing and unpack_from
on the mmap
itself:
with open('myfile.bin', 'rb') as f:
with mmap.mmap(f, 0, access=mmap.ACCESS_READ) as m:
for offset in range(0, m.size(), record.size):
yield record.unpack_from(m, offset)
Upvotes: 5
Reputation: 9890
You can unpack a few at a time. Let's start with this example:
In [44]: a = struct.pack("iiii", 1, 2, 3, 4)
In [45]: a
Out[45]: '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'
If you're using a string, you can just use a subset of it, or use unpack_from:
In [49]: struct.unpack("ii",a[0:8])
Out[49]: (1, 2)
In [55]: struct.unpack_from("ii",a,0)
Out[55]: (1, 2)
In [56]: struct.unpack_from("ii",a,4)
Out[56]: (2, 3)
If you're using a buffer, you'll need to use unpack_from
.
Upvotes: 2