Reputation:
I'm getting a string from a qt widget, and I'm trying to convert the non ascii characters (eg. €) into hex unicode characters (eg. x20ac)
Currently I'm doing to see the unicode character is this:
currentText = self.rich_text_edit.toPlainText() # this string is the € symbol
print("unicode char is: {0}".format(unicode_text))
This provides me with the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
That's actually what I want, right there, the 20ac.
How do I get at that?
If I do this:
unicode_text = str(unicode_text).encode('string_escape')
print unicode_text #returns \xe2\x82\xac
It returns 3 characters, all of them wrong, I'm going round in circles :)
I know it's a fairly basic question, but I've never had to worry about unicode before.
Many thanks in advance, Ian
Upvotes: 1
Views: 683
Reputation: 59671
\xe2\x82\xac
is the UTF-8 encoding of Unicode \x20ac
.
Think of it as follows, Unicode is a 1 to 1 mapping between an integer number and a character similar to ASCII, except Unicode goes much much higher in its number of integer to character mappings.
Your €
symbol has a integer value of 8364
(or \x20ac
in hex), which is far too big to fit into an 8-bit value of 256 - and so \x20ac
is broken down into 3 individual bytes of \xe2\x82\xac
. This is a very high level overview, but I'd really recommend you take a look at this excellent explanation from Scott Hanselman:
Why the #AskObama Tweet was Garbled on Screen.
As for your question, you can simply do
>>> print "unicode code point is: {0}".format(hex(ord(unicode_text)))
unicode code point is: 0x20ac
Upvotes: 3