Paul R
Paul R

Reputation: 2797

Read hex from file (Python)

I have a file, that contains both hex data and non-hex data. For example, var _0x36ba=["\x69\x73\x41\x72\x72\x61\x79","\x63\x61\x6C\x6C","\x74\x6F\x53\x74\x72\x69\x6E\x67",]

When I directly paste this code in python console, I got var _0x36ba=["isArray","call","toString",]

But when I try to read the file and print contents, it gives me var _0x36ba=["\\x69\\x73\\x41\\x72\\x72\\x61\\x79","\\x63\\x61\\x6C\\x6C","\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67","\\

Seems like backslashes are parsed as they are.

How can I read the file and obtain readable output?

Upvotes: 0

Views: 2704

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121814

You have string literals with \xhh hex escapes. You can decode these with the string_escape encoding:

text.decode('string_escape')

See the Python Specific Encodings section of the codecs module documentation:

string_escape
Produce a string that is suitable as string literal in Python source code

Decoding reverses that encoding:

>>> "\\x69\\x73\\x41\\x72\\x72\\x61\\x79".decode('string_escape')
'isArray'
>>> "\\x63\\x61\\x6C\\x6C".decode('string_escape')
'call'
>>> "\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67".decode('string_escape')
'toString'

Being a built-in codec, this is a lot faster than using regular expressions:

>>> from timeit import timeit
>>> import re
>>> def unescape(text):
...     return re.sub(r'\\x([0-9a-fA-F]{2})',
...         lambda m: chr(int(m.group(1), 16)), text)
...
>>> value = "\\x69\\x73\\x41\\x72\\x72\\x61\\x79"
>>> timeit('unescape(value)', 'from __main__ import unescape, value')
6.254786968231201
>>> timeit('value.decode("string_escape")', 'from __main__ import value')
0.43862390518188477

That's about 14 times faster.

Upvotes: 2

Alfe
Alfe

Reputation: 59426

EDIT: Please use Martijn's solution. I didn't know the text.decode('string_escape') yet, and of course it is way faster. Below follows my original answer.

Use this regular expression to unescape all escaped hexadecimal expressions within the string:

def unescape(text):
    return re.sub(r'\\\\|\\x([0-9a-fA-F]{2})',
        lambda m: chr(int(m.group(1), 16)) if m.group(1)
                  else '\\', text)

If you know that the input will not contain a double backslash followed by an x (e. g. foo bar \\x41 bloh which probably should be interpreted to foo bar \x41 bloh instead of to foo bar \A bloh), then you can simplify this to:

def unescape(text):
    return re.sub(r'\\x([0-9a-fA-F]{2})',
        lambda m: chr(int(m.group(1), 16)), text)

Upvotes: 1

Related Questions