How to convert chr(0xdfff) to utf-8 bytes in Python 3 as in Python 2?

Question

The code below explains my problem. It works as expected in Python 2.7, yet, all of the encode calls I've tried fail in Python 3.5 (see exception below)... does anyone on a way to circumvent this error and make it work in Python 3.5 as it did on Python 2.7?

import sys

if sys.version_info[0] <= 2:
    chr = unichr

out = chr(0xdfff)
print(repr(out)) # outputs '\udfff' both in Python 2 and 3
assert out.encode('utf-8').decode('utf-8') == out
assert out.encode('utf-8', errors='surrogateescape').decode('utf-8') == out
assert out.encode('utf-8', errors='strict').decode('utf-8') == out

Error in Python 3.5:

Traceback (most recent call last):
  File "W:\rocky40\Projects\etk\coilib50\source\python\coilib50\io\xmlpickle\snippet.py", line 8, in 
    assert out.encode('utf-8').decode('utf-8') == out
UnicodeEncodeError: 'utf-8' codec can't encode character '\udfff' in position 0: surrogates not allowed

Note that a different encoding wouldn't really suit it as I have files written this way to disk in Python 2 and I need to be able to load it back and dump it again on Python 3 so that Python 2 can read it again (so, the actual bytes written shouldn't really change).

Fabio Zadrozny · Accepted Answer

After searching a bit more I noticed that https://docs.python.org/3/library/codecs.html#codec-base-classes points to a surrogatepass which is specific to utf-X codecs, so, using surrogatepass instead of surrogateescape does seem to get the trick done and works properly on Python 3:

assert out.encode('utf-8', errors='surrogatepass'
    ).decode('utf-8', errors='surrogatepass') == out

How to convert chr(0xdfff) to utf-8 bytes in Python 3 as in Python 2?

Answers (2)

Related Questions