Fabio Zadrozny
Fabio Zadrozny

Reputation: 25362

How to convert chr(0xdfff) to utf-8 bytes in Python 3 as in Python 2?

The code below explains my problem. It works as expected in Python 2.7, yet, all of the encode calls I've tried fail in Python 3.5 (see exception below)... does anyone on a way to circumvent this error and make it work in Python 3.5 as it did on Python 2.7?

import sys

if sys.version_info[0] <= 2:
    chr = unichr

out = chr(0xdfff)
print(repr(out)) # outputs '\udfff' both in Python 2 and 3
assert out.encode('utf-8').decode('utf-8') == out
assert out.encode('utf-8', errors='surrogateescape').decode('utf-8') == out
assert out.encode('utf-8', errors='strict').decode('utf-8') == out

Error in Python 3.5:

Traceback (most recent call last):
  File "W:\rocky40\Projects\etk\coilib50\source\python\coilib50\io\xmlpickle\snippet.py", line 8, in <module>
    assert out.encode('utf-8').decode('utf-8') == out
UnicodeEncodeError: 'utf-8' codec can't encode character '\udfff' in position 0: surrogates not allowed

Note that a different encoding wouldn't really suit it as I have files written this way to disk in Python 2 and I need to be able to load it back and dump it again on Python 3 so that Python 2 can read it again (so, the actual bytes written shouldn't really change).

Upvotes: 2

Views: 1872

Answers (2)

Fabio Zadrozny
Fabio Zadrozny

Reputation: 25362

After searching a bit more I noticed that https://docs.python.org/3/library/codecs.html#codec-base-classes points to a surrogatepass which is specific to utf-X codecs, so, using surrogatepass instead of surrogateescape does seem to get the trick done and works properly on Python 3:

assert out.encode('utf-8', errors='surrogatepass'
    ).decode('utf-8', errors='surrogatepass') == out

Upvotes: 2

developer_hatch
developer_hatch

Reputation: 16214

The problem is that char belongs to utf-16:

import sys

if sys.version_info[0] <= 2:
    chr = unichr

out = chr(0xdfff)

print(out.encode('utf-16-le', 'ignore').decode('utf-16-le', 'ignore') == out)
print(out.encode('utf-16-le', 'ignore').decode('utf-16-le', 'ignore') == out)
print(out.encode('utf-16-le', 'ignore').decode('utf-16-le', 'ignore') == out)

That compiles and works, but... acording to this answer, you will have problems with surrogates

Upvotes: 0

Related Questions