Valentina
Valentina

Reputation: 534

Python 2.7: Back to utf-8 after using decode unicode-escape

I am trying to remove chars, but have different inputs (can be u'ä' or \\u0001 and so on) so I used encode(utf-8).decode(unicode-escape) to bring it to the same format and clean:

s = "\\u0001" 
s.encode("utf-8")
'\\u0001'
s.encode("utf-8").decode("unicode-escape")
u'\x01'

or

s = u'ä'
s.encode("utf-8")
'\xc3\xa4'
s.encode("utf-8").decode("unicode-escape")
u'\xc3\xa4'

The question is how to get back to utf-8 after? Found .encode("raw_unicode_escape") which passes basic tests, but still not sure.

Upvotes: 0

Views: 6952

Answers (2)

PM 2Ring
PM 2Ring

Reputation: 55479

I don't understand why (or how) you have a mixture of byte strings and Unicode strings like that. But if that's what you're data is like then you need to process the two types of strings differently.

The code below first prints the representation of each string in data, and the type of object that the string is.
It then calls the decode('unicode-escape') method on the plain byte strings, which will convert them to Unicode strings.
Then all strings are encoded from Unicode to UTF-8 byte strings.

data = [
    'byte string',
    u'unicode string',
    'this byte string has unicode escapes: \\u2122\\u00e6',
    u'this unicode string has non-ascii chars: ©æ™ä',
]

for s in data:
    print repr(s), type(s)
    if isinstance(s, str):
        s = s.decode('unicode-escape')
    z = s.encode('utf8')
    print repr(z), z
    print

output

'byte string' <type 'str'>
'byte string' byte string

u'unicode string' <type 'unicode'>
'unicode string' unicode string

'this byte string has unicode escapes: \\u2122\\u00e6' <type 'str'>
'this byte string has unicode escapes: \xe2\x84\xa2\xc3\xa6' this byte string has unicode escapes: ™æ

u'this unicode string has non-ascii chars: \xa9\xe6\u2122\xe4' <type 'unicode'>
'this unicode string has non-ascii chars: \xc2\xa9\xc3\xa6\xe2\x84\xa2\xc3\xa4' this unicode string has non-ascii chars: ©æ™ä

The above output was produced in a terminal that's configured to use UTF-8.

Upvotes: 3

mhawke
mhawke

Reputation: 87084

Like this:

>>> s = "\\u0001"
>>> s.decode('unicode-escape')
u'\x01'
>>> s.decode('unicode-escape').encode('utf8')
'\x01'

Here is an example for which it is a bit more obvious that the result is UTF-8 encoded:

>>> s = "\\u3030"
>>> s.decode('unicode-escape')
u'\u3030'
>>> s.decode('unicode-escape').encode('utf8')
'\xe3\x80\xb0'

Upvotes: 1

Related Questions