serg
serg

Reputation: 111325

How to decode partially escaped unicode string in python (mixed unicode and escaped unicode)?

Given the following string:

str = "\\u20ac €"

How to decode it into € €?

Using str.encode("utf-8").decode("unicode-escape") returns € â\x82¬

(To clarify, I am looking for a general solution how to decode any mix of unicode and escaped characters)

Upvotes: 2

Views: 750

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177901

A simple and fast solution is to use re.sub to match \u and exactly four hexadecimal digits, and convert those digits into a Unicode code point:

import re

s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"
print(s)

s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)
print(s)

Output:

blah bl\uah \u20ac € b\u20aclah\u12blah blah
blah bl\uah € € b€lah\u12blah blah

Upvotes: 2

DeepSpace
DeepSpace

Reputation: 81654

If this is always going to be the format of the string, use .split:

string = "\\u20ac €"
escaped_unicode, non_escaped_unicode = string.split()
output = '{} {}'.format(escaped_unicode.encode("utf-8").decode("unicode-escape"), non_escaped_unicode)
print(output)
# € €

If not, we'll need to get more creative. I think the most generic solution will be to still use split, but then use regex to determine if we need to handle an escaped unicode (assuming the input is sane enough to not mix unicode and escaped unicode in the same "word")

import re

string = "ac ab \\u20ac cdef €"
regex = re.compile(r'([\u0000-\u007F]+)')
output = []
for word in string.split():
    match = regex.search(word)
    if match:
        try:
            output.append(match[0].encode("utf-8").decode("unicode-escape"))
        except UnicodeDecodeError:
            # assuming the string contained a literal \\u or anything else
            # that decode("unicode-escape") could not handle, so adding to output as is
            output.append(word)
    else:
        output.append(word)
print(' '.join(output))
# ac ab € cdef €

Upvotes: 1

Related Questions