Reputation: 451
I have input and output text files which can contain non-ascii characters. Sometimes I need to escape them and sometimes I need to write the non-ascii characters. Basically if I get "Bürgerhaus" I need to output "B\u00FCrgerhaus". If I get "B\u00FCrgerhaus" I need to output "Bürgerhaus".
One direction goes fine:
>>> s1 = "B\u00FCrgerhaus"
>>> print(s1)
Bürgerhaus
however in the other direction I do not get the expected result ('B\u00FCrgerhaus'):
>>> s2 = "Bürgerhaus"
>>> s2_trans = s2.encode('utf8').decode('unicode_escape')
>>> print(s2_trans)
Bürgerhaus
I read that unicode-escape needs latin-1, I tried to encode it to it, but this did not product a result either. What am I doing wrong?
(PS: Thank you Matthias for reminding me that the conversion in the first example was not necessary.)
Upvotes: 2
Views: 2216
Reputation: 18126
You could do something like this:
charList=[]
s1 = "Bürgerhaus"
for i in [ord(x) for x in s1]:
# Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
if i < 128: # not sure if that is right or can be made easier!
charList.append(chr(i))
else:
charList.append('\\u%04x' % i )
res = ''.join(charList)
print(f"Mixed up sting: {res}")
for myStr in (res, s1):
if '\\u' in myStr:
print(myStr.encode().decode('unicode-escape'))
else:
print(myStr)
Out:
Mixed up sting: B\u00fcrgerhaus
Bürgerhaus
Bürgerhaus
Explanation:
We are going to covert each character to it's corresponding Unicode code point.
print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]
Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts ... got values >= 128 (detailed table here).
Now, we are going to 'encoded' all characters >= 128 with their corresponding unicode representation.
Upvotes: 2
Reputation: 764
You can only decode()
bytestrings (bytes
) to [unicode] strings, and conversely, encode()
[unicode] strings to bytes
.
So if you want to decode a string escaped with unicode-escape
, you need to first convert (encode()
) it to a bytestring, e.g., using latin1
as you wrote in the question.
>>> encoded_str = 'B\\xfcrgerhaus'
>>> encoded = encoded_str.encode('latin-1')
>>> encoded
b'B\\xfcrgerhaus'
>>> encoded.decode('unicode-escape')
'Bürgerhaus'
>>> _.encode('unicode-escape')
b'B\\xfcrgerhaus'
>>> _ == encoded
True
See also: how do I .decode('string-escape') in Python3?
Upvotes: 0