How to convert double-backslash unicode chars to originals in Python?

Question

I am working with multilingual strings and many of them have Hebrew, Arabic, Chinese etc. characters in them that are encoded and appear in this format: \x00, i.e. a two-digit double-backslash escape. The strings are in bytes format, i.e. they appear in this format: b''.

I have read several comments here on SO and other sites but still can't get my head around as to how to convert these back to original characters.

I know that if the backslashes were single ones, the following would solve it:

b'\xd7\x90\xd7\x91\xd7\x92'.decode('utf-8')

and it would return: 'אבג'.

But when applying the same .decode('utf-8') method on my string, the outputs are messed up.

b'\xd7\x90\xd7\x91\xd7\x92'.decode('utf-8')

will return: '×\x90×\x91×\x92'.

I really wouldn't want to apply regex on it, there must be a nicer solution that I am not aware of!

chepner · Accepted Answer

It's not pretty, but assuming your string has no quotation marks in it, you might try

>>> x = b'\xd7\x90\xd7\x91\xd7\x92'
>>> ast.literal_eval(ast.literal_eval(f'"{x}"')).decode()
'אבג'

This is based the assumption that the original value was indeed a normally encoded str object:

>>> 'אבג'.encode()
b'\xd7\x90\xd7\x91\xd7\x92'

but you got its representation instead of the actual string.

>>> repr('אבג'.encode())
"b'\xd7\x90\xd7\x91\xd7\x92'"

Wrapping your value in quotes creates a string that literal_eval can restore to the representation shown above, which can be further evaluated to a "real" byte string that can be decoded.

How to convert double-backslash unicode chars to originals in Python?

Answers (1)

Related Questions