Kyuwon Cho
Kyuwon Cho

Reputation: 45

How can I convert unicode string contain characters that out-of-range utf-8 or 16 to binary or hex in python?

I have some unicode string witch contain some character that can't encode utf-8 or utf-16 like \ud875. I want to write this string on file. What can I do?

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\ud875' in position 70: surrogates not allowed

This error is occurred when I try to write this string on file.

Upvotes: 0

Views: 1682

Answers (1)

jsbueno
jsbueno

Reputation: 110311

The answer is: you don't! One can't do that and end up with a byte sequence that will, per se, represent back the original text.

The fact is that if a unicode character does not have a representation in utf-8 or utf-16, it can't be represented as such, end of story.

If one ends up with arbitrary data inside a text string, and have to store that as bytes, one can use one of the "charmap" codecs, in which each of the characters in the 0-255 range have a representation, and then those bytes can roundtrip to bytes and back to text (but you should just use then as bytes anyway).

If you have arbitrary higher codepoints that are "non characters", normally you can't encode then. The utf-8 and utf-16 descriptions allow arbitrary characters to be encoded - as the specs describe those encodings as bit-field mappings to get back to the codepoint value. However, the special "surrogate" character class, that are exactly the characters used by utf-16 to represent characters outside of the Base Multilingual Plane (BMP), are explicitly out-ruled.

Fortunately (or unfortunately, since it looks like you may be doing "the wrong thing" to start with), Python have, since python 3.1, explictly enabled the encoding of surrogate characters as utf-8 (and later as utf-16 and utf32) characters, by selecting a special "errors" policy on encode and decoding.

Keep in mind, as I wrote in the starting sentence, that the resulting byte sequence is not valid utf-8 (or 16) "as is" - any code consuming this data back, have to be aware of how the byte-sequence was created, and use the same "allow surrogates" policy on decoding:


In [75]: a = "maçã\ud875"                                                                                                                                                                      

In [76]: b = a.encode("utf-8", errors="surrogatepass")                                                                                                                                         

In [77]: b                                                                                                                                                                                     
Out[77]: b'ma\xc3\xa7\xc3\xa3\xed\xa1\xb5'

In [78]: b.decode("utf-8")                                                                                                                                                                     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-78-a863a95176d0> in <module>
----> 1 b.decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte

In [79]: b.decode("utf-8", errors="surrogatepass")                                                                                                                                             
Out[79]: 'maçã\ud875'

In [80]: b.decode("utf-8", errors="surrogatepass") == a                                                                                                                                        
Out[80]: True

You could also use errors="xmlcharrefreplace" and errors="backslashreplace", but reading those back would be even more cumbersome, besides, if the text would have embedded literal sequences of representations of characters with those escaping methods, those would be converted to the characters in the final form - The positive point in doing this is that the resulting bytes would be valid utf-8:

In [82]: a = "maçã\ud875"                                                                                                                                                                      

In [83]: b = a.encode("utf8", errors="backslashreplace")                                                                                                                                       

In [84]: b                                                                                                                                                                                     
Out[84]: b'ma\xc3\xa7\xc3\xa3\\ud875'

In [85]: c = b.decode("utf-8")                                                                                                                                                                 

In [86]: c == a                                                                                                                                                                                
Out[86]: False

In [87]: c                                                                                                                                                                                     
Out[87]: 'maçã\\ud875'

In [88]: d = c.encode("latin1").decode("unicode_escape")                                                                                                                                       

In [89]: d                                                                                                                                                                                     
Out[89]: 'maçã\ud875'

In [90]: d == a                                                                                                                                                                                
Out[90]: True

Upvotes: 2

Related Questions