User New
User New

Reputation: 404

hex header of file, magic numbers, python

I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)

The problem is the following: I read first 24 bytes of a file:

with open (from_folder+"/"+i, "rb") as myfile:
    header=str(myfile.read(24))

then I look for pattern in it:

if y[1] in header:
    shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])

where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']

y[1] is the pattern and = r'\x47\x40\x00'

the file has it inside, as you can see from the picture below.

enter image description here

the program does NOT find this pattern (r'\x47\x40\x00') in the file header.

so, I tried to print header:

enter image description here

You see? Python sees it as 'G@' instead of '\x47\x40'

and if i search for 'G@'+r'\x00' in header - everything is ok. It finds it.

Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G@'+r'\x00'.

OR

why python sees first two numbers as 'G@' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?

Upvotes: 3

Views: 7775

Answers (2)

Jacques de Hooge
Jacques de Hooge

Reputation: 6990

In Python 3 you'll get bytes from a binary read, rather than a string. No need to convert it to a string by str. Print will try to convert bytes to something human readable. If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:

aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))

Output as redirected from the console:

b'\x00G@\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0

You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.

If you want to apply a string search on '\x00\x47\x40', use:

aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))

Which will give you:

b'\x00G@\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0

So there's a number of separate issues at play here:

  • print tries to print something human readable, which succeeds only for the first two chars.

  • You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.

Upvotes: 1

User New
User New

Reputation: 404

    with open (from_folder+"/"+i, "rb") as myfile:
        header=myfile.read(24)
        header = str(binascii.hexlify(header))[2:-1]

the result I get is: And I can work with it

4740001b0000b00d0001c100000001efff3690e23dffffff

P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.

Upvotes: 3

Related Questions