Jannik
Jannik

Reputation: 1015

Convert single Unicode character to ASCII character

I got a unicode e.g. "00C4" saved in an array. I want to replace a placeholder e.g. "\A25" in a text with the ascii value of an unicode from the array which only has the unicode value. I tried everything from encoding, decoding, raw strings, unicode strings and different setups with the escape symbol "\". The issue here is that i can not write the clear '\u1234' in the code, I have to use the array values and combine it with something like '\u'. This is my current code:

e.g. prototypeArray[i][1] = 00C4

e.g. prototypeArray[i][0] = A25

unicodeChar = u'\\u' + prototypeArray[i][1]
placeholder = '\\' + prototypeArray[i][0]
placeholder = u'' + placeholder
text = text.replace(placeholder,s)

Currently it is only replacing e.g. \A25 with \u00C4 in the text. The unicode character is not interpreted as such.

Upvotes: 1

Views: 1156

Answers (1)

meisterluk
meisterluk

Reputation: 824

UTF-8 specific interpretation: I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:

>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'

Explanation: zfill prepends a zero if the number of characters is odd. binascii.unhexlify basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8') interprets those bytes as UTF-8 encoded data and returns it as string.

>>> cp2chr('00C4')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data

Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4 has bit structure 11000100, is therefore a continuation byte and requires another character afterwards.

Encoding independent interpretation: So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape encoding:

>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape') 
>>> cp2chr('00C4')
'Ä'

Explanation: raw_unicode_escape convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape') gives Ä. This is what python does internally if you write \uSOMETHING in your source code.

Upvotes: 2

Related Questions