Reputation: 1015
I got a unicode e.g. "00C4" saved in an array. I want to replace a placeholder e.g. "\A25" in a text with the ascii value of an unicode from the array which only has the unicode value. I tried everything from encoding, decoding, raw strings, unicode strings and different setups with the escape symbol "\". The issue here is that i can not write the clear '\u1234' in the code, I have to use the array values and combine it with something like '\u'. This is my current code:
e.g. prototypeArray[i][1] = 00C4
e.g. prototypeArray[i][0] = A25
unicodeChar = u'\\u' + prototypeArray[i][1]
placeholder = '\\' + prototypeArray[i][0]
placeholder = u'' + placeholder
text = text.replace(placeholder,s)
Currently it is only replacing e.g. \A25 with \u00C4 in the text. The unicode character is not interpreted as such.
Upvotes: 1
Views: 1156
Reputation: 824
UTF-8 specific interpretation: I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:
>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'
Explanation: zfill
prepends a zero if the number of characters is odd. binascii.unhexlify
basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8')
interprets those bytes as UTF-8 encoded data and returns it as string.
>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data
Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4
has bit structure 11000100
, is therefore a continuation byte and requires another character afterwards.
Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape
encoding:
>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape')
>>> cp2chr('00C4')
'Ä'
Explanation: raw_unicode_escape
convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape')
gives Ä
. This is what python does internally if you write \uSOMETHING
in your source code.
Upvotes: 2