Find previous byte that isn't equal to unwanted byte in Python

Question

I'm programming a small script that is meant to open a binary file, find an often-changing binary blob, and copy just that blob to a new file.

Here's the layout of the binary file:

-JUNK (Unknown Size) (Unknown Contents)
-3-byte HEADER containing encoded size of blob
-PADDING (Unknown Size) (Every byte is FF in hex)
-Start of blob (72 bytes) (Unknown Contents)
-16 bytes that are ALWAYS the same
-End of blob (Size can be determined from subtracting (72+16) from value HEADER) (Unknown Contents)
-JUNK (Unknown Size) (Unknown Contents)

Here's the code I've written so far:

from sys import argv
import binascii
import base64

InputFileName = argv[1]

with open(InputFileName, 'rb') as InputFile:

    Constant16 = base64.b64decode("GIhTSuBask6y60iLI2VwIg==")
    Constant16Offset = InputFile.read().find(Constant16)

    InputFile.seek(Constant16Offset)
    InputFile.seek(-72,1)

    InputFile.seek(-1,1)
    FFTestVar = InputFile.read(1)

    while FFTestVar == b'\xFF':
        InputFile.seek(-2,1)
        FFTestVar = InputFile.read(1)

    InputFile.seek(-3,1)
    BlobSizeBin = InputFile.read(3)
    BlobSizeHex = binascii.b2a_hex(BlobSizeBin)
    BlobSizeDec = int(BlobSizeHex, 16)

    InputFile.seek(Constant16Offset)
    InputFile.seek(-72,1)

    Blob = InputFile.read(BlobSizeDec)

    with open('output.bin', 'wb') as OutputFile:

        OutputFile.write(Blob)

Unfortunately, the while loop is SLOW. InputFile could be up to 24MB large, and the padding could be a huge chunk of that. Going through it one byte at a time is ridiculously slow.

I'm thinking that there's probably a better way of doing this, but an hour or two of Googling hasn't been helpful.

Thanks!

Chris Medrela · Accepted Answer

You can read whole file into memory (you actually do it):

data = InputFile.read()

And then you can treat data like casual string (but it's not unicode string but an array of bytes, which is unfortunately called str under python 2.X). You need to remember offset so we will create offset attribute. Every line which looks like InputFile.seek(xx) must be translated into offset = xx and InputFile.seek(xx, 1) into offset += xx.

magic_number = base64.b64decode("GIhTSuBask6y60iLI2VwIg==")
offset = magic_number_offset = data.find(magic_number)
offset -= 72

Then, instead of while loop use re module (you need to import that module):

pattern = re.compile("[^\xFF]\xFF*$")
offset = pattern.search(data, endpos=offset).start() + 1

And the rest of code is:

offset -= 3
blob_size_bin = data[offset:offset+3]
blob_size_hex = binascii.b2a_hex(blob_size_bin)
blob_size_dec = int(blob_size_hex, 16)
offset = magic_number_offset - 72
blob = data[offset:offset+blob_size_dec]

If the files are really big and the python process consumes a lot of memory, you can use mmap module instead of loading whole file into memory.

If this solutions is still slow, you can reverse order of your data (reversed_data = data[::-1]) and search for pattern [^\ff].

Find previous byte that isn't equal to unwanted byte in Python

Answers (1)

Related Questions

Find previous byte that isn&#39;t equal to unwanted byte in Python

Answers (1)

Related Questions

Find previous byte that isn't equal to unwanted byte in Python