Reputation: 37
I have a bunch of strings containing UTF-8 encoded symbols, for example '\\u00f0\\u009f\\u0098\\u0086'
.
In that case, it represents this emoji 😆
, encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1
and then decode it to utf-8
. So,
'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')
gives me the output
'😆'
Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,
'\\u00f0\\u009f\\u0098\\u0086'.encode('latin-1').decode('utf-8')
I get the following result,
'\\u00f0\\u009f\\u0098\\u0086'
Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.
Upvotes: 0
Views: 1262
Reputation: 2776
b'\\u00f0\\u009f\\u0098\\u0086'
can be decoded directly by using encoding "unicode_escape".
For example:
>>> b'\\u00f0\\u009f\\u0098\\u0086'.decode("unicode_escape")
'ð\x9f\x98\x86'
Although it seems different, it's the same:
>>> b'\\u00f0\\u009f\\u0098\\u0086'.decode("unicode_escape") == '\u00f0\u009f\u0098\u0086'
True
Beware that this will remove escaped backslashes on their own! For example, the following JSON will break:
>>> encoded_json = b'{"a":"Basic realm=\\"Dost\\udceapz"}'
>>> encoded_json.decode("unicode_escape")
'{"a":"Basic realm="Dost\udceapz"}'
>>> json.loads(encoded_json)
{'a': 'Basic realm="Dost\udceapz'}
>>> json.loads(encoded_json.decode("unicode_escape"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 20 (char 19)
Upvotes: 1
Reputation: 52529
I don't know where you're getting that string from, but it's an.... unusual... way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86
. In Python string escapes, \uXXXX
is used to represent an entire codepoint in the Basic Multilingual Plane, and \UXXXXXXXX
codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you'd expect to see it represented in a string as '\U0001F606'
Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:
import re
str='\\u00f0\\u009f\\u0098\\u0086'
print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
# Displays 😆
Upvotes: 1