Reputation: 1349
I am working with multilingual strings and many of them have Hebrew, Arabic, Chinese etc. characters in them that are encoded and appear in this format: \\x00
, i.e. a two-digit double-backslash escape. The strings are in bytes format, i.e. they appear in this format: b''
.
I have read several comments here on SO and other sites but still can't get my head around as to how to convert these back to original characters.
I know that if the backslashes were single ones, the following would solve it:
b'\xd7\x90\xd7\x91\xd7\x92'.decode('utf-8')
and it would return: 'אבג'
.
But when applying the same .decode('utf-8')
method on my string, the outputs are messed up.
b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'.decode('utf-8')
will return: '×\x90×\x91×\x92'
.
I really wouldn't want to apply regex on it, there must be a nicer solution that I am not aware of!
Upvotes: 1
Views: 383
Reputation: 531708
It's not pretty, but assuming your string has no quotation marks in it, you might try
>>> x = b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'
>>> ast.literal_eval(ast.literal_eval(f'"{x}"')).decode()
'אבג'
This is based the assumption that the original value was indeed a normally encoded str
object:
>>> 'אבג'.encode()
b'\xd7\x90\xd7\x91\xd7\x92'
but you got its representation instead of the actual string.
>>> repr('אבג'.encode())
"b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'"
Wrapping your value in quotes creates a string that literal_eval
can restore to the representation shown above, which can be further evaluated to a "real" byte string that can be decoded.
Upvotes: 1