Reputation: 2936
I have an 16bit big endian unicode string represented as u'\u4132'
,
how can I split it into integers 41 and 32 in python ?
Upvotes: 12
Views: 57842
Reputation: 1593
Pass the unicode character to ord()
to get its code point and then break that code point into individual bytes with int.to_bytes()
and then format the output however you want:
list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))
returns: ['0', '0', '41', '32']
list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))
returns: ['0', '1', 'f4', 'a9']
As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.
Upvotes: 0
Reputation: 41625
"\u4132".getBytes("UTF-16BE")
u'\u4132'.encode('utf-16be')
'\u4132'.encode('utf-16be')
These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFF
will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).
Upvotes: 4
Reputation: 90752
Here are a variety of different ways you may want it.
Python 2:
>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']
Python 3:
>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']
Upvotes: 19
Reputation: 46745
"Those" aren't integers, it's a hexadecimal number which represents the code point.
If you want to get an integer representation of the code point you need to use ord(u'\u4132')
if you now want to convert that back to the unicode character use unicode()
which will return a unicode string.
Upvotes: 2