Reputation: 15
I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. The string reads:
"Skr\u00E4ddarev\u00E4gen"
The string when properly converted should read "Skräddarevägen". How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents?
(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8) Thanks in advance!
Upvotes: 0
Views: 2499
Reputation: 139
In Python 3, the following can happen:
Write the string out to a file, you have to specify the encoding you want in the file open.
Upvotes: 4
Reputation: 177911
If you are using Python 3 and that is literally the content of the string, it "just works":
>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'
If you have that string as raw data, you have to decode it. If it is a Unicode string you'll have to encode it to bytes first. The final result will be Unicode. If you already have a byte string, skip the encode step.
>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'
If you are on Python 2, you'll need to decode, plus print to see it properly:
>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen
Upvotes: 0
Reputation: 19174
From your display, it is hard to be sure what is in the string. Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question.
s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))
This prints
24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'
Upvotes: 0