Python Unicode Casting on Variable Bug

Question

I've found out this weird python2 behavior related to unicode and variable:

>>> u"\u2730".encode('utf-8').encode('hex')
'e29cb0'

This is the expected result I need, but I want to dynamically control the first part ("u\u2730")

>>> type(u"\u2027")

Good, so the first part is casted as unicode. Now declaring a string variable and casting it to unicode:

>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> type(myvar)

>>> print myvar
\u2027

It seems that now I can use the variable in my original code, right?

>>> myvar.encode('utf-8').encode('hex')
'5c7532303237'

The results, as you can see, is not the original one. It seems that python is treating 'myvar' as string instead of unicode. Do I miss something?

Anyway, my final goal is to loop Unicode from \u0000 to \uFFFF, cast them as string and cast the string as HEX. Is there an easy way?

juanpa.arrivillaga · Accepted Answer

You are confusing the Unicode escape sequence with an the \u characters. It's like confusing r" " (or "\n") with an actual newline. You want to ~~usecodecs.raw_unicode_escape_decode~~ decode the str with 'unicode_escape':

>>> import codecs
>>> a='20'
>>> b='27'
>>> myvar='\u'+a+b.decode('utf-8')
>>> myvar
u'\u2027'
>>> myvar.decode('unicode_escape')
(u'\u2027', 6)
>>> print(myvar.decode('unicode_escape')[0])
‧

Python Unicode Casting on Variable Bug

Answers (2)

Related Questions