Alaa Waheed
Alaa Waheed

Reputation: 23

Python 3.10 Binary splitting script(inconsistent output)

I need to split a .bin file into chunks. However, I seem to face a problem when it comes to writing the output in the split/new binary file. The output is inconsistent, I can see the data, but there are shifts and gaps everywhere when comparing the split binary with the bigger original one.

def hash_file(filename: str, blocksize: int = 4096) -> str:
blocksCount = 0
with open(filename, "rb") as f:
    while True:

        #Read a new chunk from the binary file
        full_string = f.read(blocksize)
        if not full_string:
            break
        new_string = ' '.join('{:02x}'.format(b) for b in full_string)
        split_string = ''.join(chr(int(i, 16)) for i in new_string.split())
        
        #Append the split chunk to the new binary file
        newf = open("SplitBin.bin","a", encoding="utf-8")
        newf.write(split_string)
        newf.close()

        #Check if the desired number of mem blocks has been reached
        blocksCount = blocksCount + 1
        if blocksCount == 1:
            break

This is a comparison between the split bin(right) and original bin(left)

Upvotes: 1

Views: 361

Answers (1)

Mark Ransom
Mark Ransom

Reputation: 308216

For characters with ordinals between 0 and 0x7f, their UTF-8 representation will be the same as their byte value. But for characters with ordinals between 0x80 and 0xff, UTF-8 will output two bytes neither of which will be the same as the input. That's why you're seeing inconsistencies.

The easiest way to fix it would be to open the output file in binary mode as well. Then you can eliminate all the formatting and splitting, because you can directly write the data you just read:

        with open("SplitBin.bin", "ab") as newf:
            newf.write(full_string)

Note that reopening the file each time you write to it will be very slow. Better to leave it open until you're done.

Upvotes: 1

Related Questions