altunyurt
altunyurt

Reputation: 2936

getting bytes from unicode string in python

I have an 16bit big endian unicode string represented as u'\u4132',

how can I split it into integers 41 and 32 in python ?

Upvotes: 12

Views: 57842

Answers (6)

Danilo Souza Morães
Danilo Souza Morães

Reputation: 1593

Pass the unicode character to ord() to get its code point and then break that code point into individual bytes with int.to_bytes() and then format the output however you want:

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.

Upvotes: 0

Roland Illig
Roland Illig

Reputation: 41625

  • Java: "\u4132".getBytes("UTF-16BE")
  • Python 2: u'\u4132'.encode('utf-16be')
  • Python 3: '\u4132'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFF will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).

Upvotes: 4

Chris Morgan
Chris Morgan

Reputation: 90752

Here are a variety of different ways you may want it.

Python 2:

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3:

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']

Upvotes: 19

jfs
jfs

Reputation: 414215

>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'

Upvotes: 2

seriyPS
seriyPS

Reputation: 7102

Dirty hack: repr(u'\u4132') will return "u'\\u4132'"

Upvotes: 1

Ivo Wetzel
Ivo Wetzel

Reputation: 46745

"Those" aren't integers, it's a hexadecimal number which represents the code point.

If you want to get an integer representation of the code point you need to use ord(u'\u4132') if you now want to convert that back to the unicode character use unicode() which will return a unicode string.

Upvotes: 2

Related Questions