Reputation: 2797
I have a file, that contains both hex data and non-hex data.
For example, var _0x36ba=["\x69\x73\x41\x72\x72\x61\x79","\x63\x61\x6C\x6C","\x74\x6F\x53\x74\x72\x69\x6E\x67",]
When I directly paste this code in python console, I got var _0x36ba=["isArray","call","toString",]
But when I try to read the file and print contents, it gives me var _0x36ba=["\\x69\\x73\\x41\\x72\\x72\\x61\\x79","\\x63\\x61\\x6C\\x6C","\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67","\\
Seems like backslashes are parsed as they are.
How can I read the file and obtain readable output?
Upvotes: 0
Views: 2704
Reputation: 1121814
You have string literals with \xhh
hex escapes. You can decode these with the string_escape
encoding:
text.decode('string_escape')
See the Python Specific Encodings section of the codecs
module documentation:
string_escape
Produce a string that is suitable as string literal in Python source code
Decoding reverses that encoding:
>>> "\\x69\\x73\\x41\\x72\\x72\\x61\\x79".decode('string_escape')
'isArray'
>>> "\\x63\\x61\\x6C\\x6C".decode('string_escape')
'call'
>>> "\\x74\\x6F\\x53\\x74\\x72\\x69\\x6E\\x67".decode('string_escape')
'toString'
Being a built-in codec, this is a lot faster than using regular expressions:
>>> from timeit import timeit
>>> import re
>>> def unescape(text):
... return re.sub(r'\\x([0-9a-fA-F]{2})',
... lambda m: chr(int(m.group(1), 16)), text)
...
>>> value = "\\x69\\x73\\x41\\x72\\x72\\x61\\x79"
>>> timeit('unescape(value)', 'from __main__ import unescape, value')
6.254786968231201
>>> timeit('value.decode("string_escape")', 'from __main__ import value')
0.43862390518188477
That's about 14 times faster.
Upvotes: 2
Reputation: 59426
EDIT: Please use Martijn's solution. I didn't know the text.decode('string_escape')
yet, and of course it is way faster. Below follows my original answer.
Use this regular expression to unescape all escaped hexadecimal expressions within the string:
def unescape(text):
return re.sub(r'\\\\|\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)) if m.group(1)
else '\\', text)
If you know that the input will not contain a double backslash followed by an x
(e. g. foo bar \\x41 bloh
which probably should be interpreted to foo bar \x41 bloh
instead of to foo bar \A bloh
), then you can simplify this to:
def unescape(text):
return re.sub(r'\\x([0-9a-fA-F]{2})',
lambda m: chr(int(m.group(1), 16)), text)
Upvotes: 1