George Mathias
George Mathias

Reputation: 15

Representing non-English characters with Unicode (UTF-8)

I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. The string reads:

"Skr\u00E4ddarev\u00E4gen"

The string when properly converted should read "Skräddarevägen". How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents?

(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8) Thanks in advance!

Upvotes: 0

Views: 2499

Answers (3)

朱梅寧
朱梅寧

Reputation: 139

In Python 3, the following can happen:

  1. If you pick up your string from an HTML file, you have to read in the HTML file using the correct encoding.
  2. If you have your string in Python 3 code, it should be already in Unicode (32-bit) in memory.

Write the string out to a file, you have to specify the encoding you want in the file open.

Upvotes: 4

Mark Tolonen
Mark Tolonen

Reputation: 177911

If you are using Python 3 and that is literally the content of the string, it "just works":

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'

If you have that string as raw data, you have to decode it. If it is a Unicode string you'll have to encode it to bytes first. The final result will be Unicode. If you already have a byte string, skip the encode step.

>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'

If you are on Python 2, you'll need to decode, plus print to see it properly:

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen

Upvotes: 0

Terry Jan Reedy
Terry Jan Reedy

Reputation: 19174

From your display, it is hard to be sure what is in the string. Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question.

s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))

This prints

24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n 
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'

Upvotes: 0

Related Questions