Reputation: 111325
Given the following string:
str = "\\u20ac €"
How to decode it into € €
?
Using str.encode("utf-8").decode("unicode-escape")
returns € â\x82¬
(To clarify, I am looking for a general solution how to decode any mix of unicode and escaped characters)
Upvotes: 2
Views: 750
Reputation: 177901
A simple and fast solution is to use re.sub
to match \u
and exactly four hexadecimal digits, and convert those digits into a Unicode code point:
import re
s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"
print(s)
s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)
print(s)
Output:
blah bl\uah \u20ac € b\u20aclah\u12blah blah
blah bl\uah € € b€lah\u12blah blah
Upvotes: 2
Reputation: 81654
If this is always going to be the format of the string, use .split
:
string = "\\u20ac €"
escaped_unicode, non_escaped_unicode = string.split()
output = '{} {}'.format(escaped_unicode.encode("utf-8").decode("unicode-escape"), non_escaped_unicode)
print(output)
# € €
If not, we'll need to get more creative. I think the most generic solution will be to still use split
, but then use regex to determine if we need to handle an escaped unicode (assuming the input is sane enough to not mix unicode and escaped unicode in the same "word")
import re
string = "ac ab \\u20ac cdef €"
regex = re.compile(r'([\u0000-\u007F]+)')
output = []
for word in string.split():
match = regex.search(word)
if match:
try:
output.append(match[0].encode("utf-8").decode("unicode-escape"))
except UnicodeDecodeError:
# assuming the string contained a literal \\u or anything else
# that decode("unicode-escape") could not handle, so adding to output as is
output.append(word)
else:
output.append(word)
print(' '.join(output))
# ac ab € cdef €
Upvotes: 1