Reputation: 39
I basically have a file that I wish search for a specific hex value (header), once this is found, read everything from that hex value location (header) until a specific hex value (footer) is found
I have some starting code:
import binascii
holdhd = ""
holdft = ""
header = "03AABBCC"
footer = "FF00FFAA"
with open ('hexfile', 'rb') as file:
bytes = file.read()
a = binascii.hexlify(bytes)
while header in a:
holdhd = header
print holdhd
This will print out the header I wish to find successfully (there are multiple headers in the file), however I am unsure on how to proceed with reading the file from this point and print out everything until the footer
is found.
Thanks in advance
Upvotes: 1
Views: 15822
Reputation: 18418
If the files are small enough so you can load them in memory, you can treat them as regular strings, and use the find
method (see here) to navigate it.
Let's go to the worse case scenario: You don't have guarantee that your header will be the first thing in the file, and you might have more than one body (more than one <header><body><footer>
block) I have created a file called bindata.txt
with the following content:
ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD
Ok, there are two bodies, first one being AAAAAA
and the second BBBBBB
and some junk in the beginning, middle and end (ABCD
before the first header, ABCDABCD
before the second header and ABCD
after the second footer)
Playing with the find
method of the str
object and the indexes, here's what I came up with:
header = "000100a0"
footer = "00000000000"
with open('bindata.txt', 'r') as f:
data = f.read()
print "Data: %s" % data
header_index = data.find(header, 0)
footer_index = data.find(footer, 0)
if header_index >= 0 and footer_index >= header_index:
print "Found header at %s and footer at %s" \
% (header_index, footer_index)
body = data[header_index + len(header): footer_index]
while body is not None:
print "body: %s" % body
header_index = data.find(header,\
footer_index + len(footer))
footer_index = data.find(footer,\
footer_index + len(footer) + len(header) )
if header_index >= 0 and footer_index >= header_index:
print "Found header at %s and footer at %s" \
% (header_index, footer_index)
body = data[header_index + len(header): footer_index]
else:
body = None
That outputs:
Data: ABCD000100a0AAAAAA000000000000ABCDABCD000100a0BBBBBB000000000000ABCD
Found header at 4 and footer at 18
body: AAAAAA
Found header at 38 and footer at 52
body: BBBBBB
If your files are too big to keep in memory, I think the best is read the file byte by byte and create a couple of functions to find where the header ends and the footer starts using the file's seek and tell methods.
EDIT:
As per OP's request, method without having to hexlify (using raw binary) and using seek and tell:
import os
import binascii
import mmap
header = binascii.unhexlify("000100a0")
footer = binascii.unhexlify("0000000000")
sample = binascii.unhexlify("ABCD"
"000100a0AAAAAA000000000000"
"ABCDABCD"
"000100a0BBBBBB000000000000"
"ABCD")
# Create the sample file:
with open("sample.data", "wb") as f:
f.write(sample)
# sample done. Now we have a REAL binary data in sample.data
with open('sample.data', 'rb') as f:
print "Data: %s" % binascii.hexlify(f.read())
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
current_offset = 0
header_index = mm.find(header, current_offset)
footer_index = mm.find(footer, current_offset + len(header))
if header_index >= 0 and footer_index > header_index:
print "Found header at %s and footer at %s"\
% (header_index, footer_index)
mm.seek(header_index + len(header))
body = mm.read(footer_index - mm.tell())
while body is not None:
print "body: %s" % binascii.hexlify(body)
current_offset = mm.tell()
header_index = mm.find(header, current_offset + len(footer))
footer_index = mm.find(footer, current_offset + len(footer) + len(header))
if header_index >= 0 and footer_index > header_index:
print "Found header at %s and footer at %s"\
% (header_index, footer_index)
mm.seek(header_index + len(header))
body = mm.read(footer_index - mm.tell())
else:
body = None
This method produces the following output:
Data: abcd000100a0aaaaaa000000000000abcdabcd000100a0bbbbbb000000000000abcd
Found header at 2 and footer at 9
body: aaaaaa
Found header at 19 and footer at 26
body: bbbbbb
Note that I used Python's mmap module to help move through the file. Please take a look to its documentation. Also, the first part of this example contains some data to create an actual binary file in sample.data
. The execution of the chunk:
# Create the sample file:
with open("sample.data", "wb") as f:
f.write(sample)
Produces the following (really human-readable) file:
borrajax@borrajax:~/Documents/Tests$ cat ./sample.data
�������ͫ�������
Upvotes: 1
Reputation: 142136
Given the file size, you might want to load everything into memory (keeping data as bytes), then use a regex to extract the part between header and footer, eg:
import binascii
import re
header = binascii.unhexlify('000100a0')
footer = binascii.unhexlify('00000000000')
with open('hexfile', 'rb') as fin:
raw_data = fin.read()
data = re.search('{}(.*?){}'.format(re.escape(header), re.escape(footer)), raw_data).group(1)
Upvotes: 2