Reputation: 753
I want to convert a number of unicode codepoints read from a file to their UTF8 encoding.
e.g I want to convert the string 'FD9B'
to the string 'EFB69B'
.
I can do this manually using string literals like this:
u'\uFD9B'.encode('utf-8')
but I cannot work out how to do it programatically.
Upvotes: 16
Views: 38028
Reputation: 399813
Use the built-in function chr()
to convert the number to character, then encode that:
>>> chr(int('fd9b', 16)).encode('utf-8')
'\xef\xb6\x9b'
This is the string itself. If you want the string as ASCII hex, you'd need to walk through and convert each character c
to hex, using hex(ord(c))
or similar.
Note: If you are still stuck with Python 2, you can use unichr()
instead.
Upvotes: 24
Reputation: 5728
Because you might encounter an error while using unichr
with wide unicode characters:
>>> n = int('0001f600', 16)
>>> unichr(n)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
Here is another approach for wide unicode on narrow python builds:
>>> n = int('0001f600', 16)
>>> s = '\\U{:0>8X}'.format(n)
>>> s = s.decode('unicode-escape')
>>> s.encode("utf-8")
'\xf0\x9f\x98\x80'
And using the original question's value:
>>> n = int('FD9B', 16)
>>> s = '\\u{:0>4X}'.format(n)
>>> s = s.decode('unicode-escape')
>>> s.encode("utf-8")
'\xef\xb6\x9b'
Upvotes: 0
Reputation: 16300
here's a complete solution:
>>> ''.join(['{0:x}'.format(ord(x)) for x in unichr(int('FD9B', 16)).encode('utf-8')]).upper()
'EFB69B'
Upvotes: 4
Reputation: 95921
If the input string length is a multiple of 4 (i.e. your unicode code points are UCS-2 encoded), then try this:
import struct
def unihex2utf8hex(arg):
count= len(arg)//4
uniarr= struct.unpack('!%dH' % count, arg.decode('hex'))
return u''.join(map(unichr, uniarr)).encode('utf-8').encode('hex')
>>> unihex2utf8hex('fd9b')
'efb69b'
Upvotes: 1
Reputation: 31718
data_from_file='\uFD9B'
unicode(data_from_file,"unicode_escape").encode("utf8")
Upvotes: 3
Reputation: 2654
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39)
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\uFD9B'.encode('utf-8')
'\xef\xb6\x9b'
>>> s = 'FD9B'
>>> i = int(s, 16)
>>> i
64923
>>> unichr(i)
u'\ufd9b'
>>> _.encode('utf-8')
'\xef\xb6\x9b'
Upvotes: 2