Reputation: 2222
Let us use the character Latin Capital Letter a with Ogonek (U+0104) as an example.
I have an int that represents its UTF-8 encoded form:
my_int = 0xC484
# Decimal: `50308`
# Binary: `0b1100010010000100`
If use the unichr
function i get: \uC484
or 쒄
(U+C484)
But, I need it to output: Ą
How do I convert my_int
to a Unicode code point?
Upvotes: 9
Views: 9697
Reputation: 536399
Encode the number to a hex string, using hex()
or %x
. Then you can interpret that as a series of hex bytes using the hex
decoder. Finally use the utf-8
decoder to get a unicode string:
def weird_utf8_integer_to_unicode(n):
s= '%x' % n
if len(s) % 2:
s= '0'+s
return s.decode('hex').decode('utf-8')
The len
check is in case the first byte is in the range 0x1–0xF, which would leave it missing a leading zero. This should be able to cope with any length string and any character (however encoding a byte sequence in an integer like this would be unable to preseve leading zero bytes).
Upvotes: 1
Reputation: 414285
>>> int2bytes(0xC484).decode('utf-8')
u'\u0104'
>>> print(_)
Ą
where int2bytes()
is defined here.
Upvotes: 1
Reputation: 59148
To convert the integer 0xC484
to the bytestring '\xc4\x84'
(the UTF-8 representation of the Unicode character Ą
), you can use struct.pack()
:
>>> import struct
>>> struct.pack(">H", 0xC484)
'\xc4\x84'
... where >
in the format string represents big-endian, and H
represents unsigned short int.
Once you have your UTF-8 bytestring, you can decode it to Unicode as usual:
>>> struct.pack(">H", 0xC484).decode("utf8")
u'\u0104'
>>> print struct.pack(">H", 0xC484).decode("utf8")
Ą
Upvotes: 3