Reputation: 534
I am trying to remove chars, but have different inputs (can be u'ä'
or \\u0001
and so on) so I used encode(utf-8
).decode(unicode-escape
) to bring it to the same format and clean:
s = "\\u0001"
s.encode("utf-8")
'\\u0001'
s.encode("utf-8").decode("unicode-escape")
u'\x01'
or
s = u'ä'
s.encode("utf-8")
'\xc3\xa4'
s.encode("utf-8").decode("unicode-escape")
u'\xc3\xa4'
The question is how to get back to utf-8
after?
Found .encode("raw_unicode_escape")
which passes basic tests, but still not sure.
Upvotes: 0
Views: 6952
Reputation: 55479
I don't understand why (or how) you have a mixture of byte strings and Unicode strings like that. But if that's what you're data is like then you need to process the two types of strings differently.
The code below first prints the representation of each string in data
, and the type of object that the string is.
It then calls the decode('unicode-escape')
method on the plain byte strings, which will convert them to Unicode strings.
Then all strings are encoded from Unicode to UTF-8 byte strings.
data = [
'byte string',
u'unicode string',
'this byte string has unicode escapes: \\u2122\\u00e6',
u'this unicode string has non-ascii chars: ©æ™ä',
]
for s in data:
print repr(s), type(s)
if isinstance(s, str):
s = s.decode('unicode-escape')
z = s.encode('utf8')
print repr(z), z
print
output
'byte string' <type 'str'>
'byte string' byte string
u'unicode string' <type 'unicode'>
'unicode string' unicode string
'this byte string has unicode escapes: \\u2122\\u00e6' <type 'str'>
'this byte string has unicode escapes: \xe2\x84\xa2\xc3\xa6' this byte string has unicode escapes: ™æ
u'this unicode string has non-ascii chars: \xa9\xe6\u2122\xe4' <type 'unicode'>
'this unicode string has non-ascii chars: \xc2\xa9\xc3\xa6\xe2\x84\xa2\xc3\xa4' this unicode string has non-ascii chars: ©æ™ä
The above output was produced in a terminal that's configured to use UTF-8.
Upvotes: 3
Reputation: 87084
Like this:
>>> s = "\\u0001"
>>> s.decode('unicode-escape')
u'\x01'
>>> s.decode('unicode-escape').encode('utf8')
'\x01'
Here is an example for which it is a bit more obvious that the result is UTF-8 encoded:
>>> s = "\\u3030"
>>> s.decode('unicode-escape')
u'\u3030'
>>> s.decode('unicode-escape').encode('utf8')
'\xe3\x80\xb0'
Upvotes: 1