Component 10
Component 10

Reputation: 10477

Get UTF-8 character codes from Python unicode string

I'm reading a string from the command line which I know is in Korean encoded as UTF-8. I get the string by running a command like so

<my_command> | od -t x1

which gives me:

0000000 ec a7 80 ec 97 ad 2f ea b5 ad ea b0 80 0a
0000016

With the six UTF-8 characters being {eca780}{ec97ad}{2f}{eab5ad}{eab080}{0a} and then I read it in Python using

utf8_str = unicode(text_from_the_cl, encoding='utf-8')

What I just want to do is to be able to see the string I've read in terms of the UTF-8 codes for the characters that I've read. So something like \uc9c0\uc5ed/\uad6d\uac00 would be good. This is just to check that they are being read in properly.

(I should point out also that this is Python 2.6.x - over which I have no control)

Upvotes: 2

Views: 3852

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121316

If you want to verify the contents of your unicode string, use the repr() function:

>>> from binascii import unhexlify
>>> unhexlify(''.join('ec a7 80 ec 97 ad 2f ea b5 ad ea b0 80 0a'.split()))
'\xec\xa7\x80\xec\x97\xad/\xea\xb5\xad\xea\xb0\x80\n'
>>> print unhexlify(''.join('ec a7 80 ec 97 ad 2f ea b5 ad ea b0 80 0a'.split())).decode('utf8')
지역/국가

>>> print repr(unhexlify(''.join('ec a7 80 ec 97 ad 2f ea b5 ad ea b0 80 0a'.split())).decode('utf8'))
u'\uc9c0\uc5ed/\uad6d\uac00\n'

The repr() result for a unicode value uses \uhhhh escape sequences for non-ASCII and non-Latin1 codepoints; for Latin 1 codepoints and for non-printable characters, \xhh escape sequences are used.

Upvotes: 1

chepner
chepner

Reputation: 530892

Use the encode method:

utf8_str.encode('utf8')

Note that utf8_str isn't a great name for the variable. The original byte sequence uses UTF-8 encoding to represent the Unicode characters; the call to unicode "decodes" them into the actual Unicode code points. To get the bytes back, you just re-encode the code points to UTF-8.

Upvotes: 1

Related Questions