Python 2.7: Back to utf-8 after using decode unicode-escape

Question

I am trying to remove chars, but have different inputs (can be u'ä' or \u0001 and so on) so I used encode(utf-8).decode(unicode-escape) to bring it to the same format and clean:

s = "\u0001" 
s.encode("utf-8")
'\u0001'
s.encode("utf-8").decode("unicode-escape")
u'\x01'

or

s = u'ä'
s.encode("utf-8")
'\xc3\xa4'
s.encode("utf-8").decode("unicode-escape")
u'\xc3\xa4'

The question is how to get back to utf-8 after? Found .encode("raw_unicode_escape") which passes basic tests, but still not sure.

PM 2Ring · Accepted Answer

I don't understand why (or how) you have a mixture of byte strings and Unicode strings like that. But if that's what you're data is like then you need to process the two types of strings differently.

The code below first prints the representation of each string in data, and the type of object that the string is.
It then calls the decode('unicode-escape') method on the plain byte strings, which will convert them to Unicode strings.
Then all strings are encoded from Unicode to UTF-8 byte strings.

data = [
    'byte string',
    u'unicode string',
    'this byte string has unicode escapes: \u2122\u00e6',
    u'this unicode string has non-ascii chars: ©æ™ä',
]

for s in data:
    print repr(s), type(s)
    if isinstance(s, str):
        s = s.decode('unicode-escape')
    z = s.encode('utf8')
    print repr(z), z
    print

output

'byte string' 
'byte string' byte string

u'unicode string' 
'unicode string' unicode string

'this byte string has unicode escapes: \u2122\u00e6' 
'this byte string has unicode escapes: \xe2\x84\xa2\xc3\xa6' this byte string has unicode escapes: ™æ

u'this unicode string has non-ascii chars: \xa9\xe6\u2122\xe4' 
'this unicode string has non-ascii chars: \xc2\xa9\xc3\xa6\xe2\x84\xa2\xc3\xa4' this unicode string has non-ascii chars: ©æ™ä

The above output was produced in a terminal that's configured to use UTF-8.

Python 2.7: Back to utf-8 after using decode unicode-escape

Answers (2)

Related Questions