Getting the unicode characters of a string

Question

I'm getting a string from a qt widget, and I'm trying to convert the non ascii characters (eg. €) into hex unicode characters (eg. x20ac)

Currently I'm doing to see the unicode character is this:

currentText = self.rich_text_edit.toPlainText() # this string is the € symbol
print("unicode char is: {0}".format(unicode_text))

This provides me with the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

That's actually what I want, right there, the 20ac.

How do I get at that?

If I do this:

unicode_text = str(unicode_text).encode('string_escape')
print unicode_text #returns \xe2\x82\xac

It returns 3 characters, all of them wrong, I'm going round in circles :)

I know it's a fairly basic question, but I've never had to worry about unicode before.

Many thanks in advance, Ian

Martin Konecny · Accepted Answer

\xe2\x82\xac is the UTF-8 encoding of Unicode \x20ac.

Think of it as follows, Unicode is a 1 to 1 mapping between an integer number and a character similar to ASCII, except Unicode goes much much higher in its number of integer to character mappings.

Your € symbol has a integer value of 8364 (or \x20ac in hex), which is far too big to fit into an 8-bit value of 256 - and so \x20ac is broken down into 3 individual bytes of \xe2\x82\xac. This is a very high level overview, but I'd really recommend you take a look at this excellent explanation from Scott Hanselman:

Why the #AskObama Tweet was Garbled on Screen.

As for your question, you can simply do

>>> print "unicode code point is: {0}".format(hex(ord(unicode_text)))
unicode code point is: 0x20ac

Getting the unicode characters of a string

Answers (2)

Related Questions