lazarea
lazarea

Reputation: 1349

How to convert double-backslash unicode chars to originals in Python?

I am working with multilingual strings and many of them have Hebrew, Arabic, Chinese etc. characters in them that are encoded and appear in this format: \\x00, i.e. a two-digit double-backslash escape. The strings are in bytes format, i.e. they appear in this format: b''.

I have read several comments here on SO and other sites but still can't get my head around as to how to convert these back to original characters.

I know that if the backslashes were single ones, the following would solve it:

b'\xd7\x90\xd7\x91\xd7\x92'.decode('utf-8')

and it would return: 'אבג'.

But when applying the same .decode('utf-8') method on my string, the outputs are messed up.

b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'.decode('utf-8')

will return: '×\x90×\x91×\x92'.

I really wouldn't want to apply regex on it, there must be a nicer solution that I am not aware of!

Upvotes: 1

Views: 383

Answers (1)

chepner
chepner

Reputation: 531708

It's not pretty, but assuming your string has no quotation marks in it, you might try

>>> x = b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'
>>> ast.literal_eval(ast.literal_eval(f'"{x}"')).decode()
'אבג'

This is based the assumption that the original value was indeed a normally encoded str object:

>>> 'אבג'.encode()
b'\xd7\x90\xd7\x91\xd7\x92'

but you got its representation instead of the actual string.

>>> repr('אבג'.encode())
"b'\\xd7\\x90\\xd7\\x91\\xd7\\x92'"

Wrapping your value in quotes creates a string that literal_eval can restore to the representation shown above, which can be further evaluated to a "real" byte string that can be decoded.

Upvotes: 1

Related Questions