Haunted
Haunted

Reputation: 43

Converting between charsets in python

I need to output some strings (to stdout) and because windows console works in cp437, if the string contains any characters outside cp437, an exception is thrown.

I got around this by

encoding=sys.stdout.encoding
pathstr = path.encode(encoding,errors="replace").decode(encoding)
print(pathstr)

where path is the str i want to output. I'm fine with characters replaced by "?"

This doesn't seem good because it converts to a byte array and back to a str.

Is there a better way to achieve this?

I'm still new to python ( a week maybe ) and I'm using Win7 32 bit with cpython 3.3

Upvotes: 2

Views: 583

Answers (3)

jfs
jfs

Reputation: 414795

I'm fine with characters replaced by "?"

You could set PYTHONIOENCODING environment variable:

C:\> set PYTHONIOENCODING=cp437:replace

And print Unicode strings directly:

print(path)

In that case, if you are redirecting to a file; you could set PYTHONIOENCODING to utf-8 and get the correct complete output.

You could also try WriteConsoleW()-based solutions from the corresponding Python bug and see if they work on Python 3.3 e.g.:

import _win_console
_win_console.install_unicode_console()

print("cyrillic: цык.")

Where _win_console is from win_console.patch. You don't need to set the environment variable in this case and it should work with any codepage (with an appropriate console font, it might even show characters outside the current codepage).

All solutions for the problem of printing Unicode inside the Windows console have drawbacks (see the discussion and the reference links in the bug tracker for all the gory details).

Upvotes: 1

bobince
bobince

Reputation: 536695

This doesn't seem good because it converts to a byte array and back to a str.

If you want to write raw bytes to the stream, use .buffer:

pathbytes= path.encode(encoding, errors= 'replace')
sys.stdout.buffer.write(pathbytes)

...oh for the day that issue 1602 comes to something and we can avoid the Unicode horror of the Windows command prompt...

Upvotes: 3

mlissner
mlissner

Reputation: 18206

The best advice I ever heard about Unicode was to make a Unicode Sandwich:

  1. Immediately convert any incoming text in your program into unicode.
  2. Deal exclusively with Unicode in your program.
  3. Export to whatever serialization format you want for your output.

In this case, you're basically doing just that. In a longer program, it would make sense to do this in the manner you describe, and I think you'd feel more comfortable about it.

The only change I'd make would be to encode to utf-8, then decode to cp437 on output.

Upvotes: 0

Related Questions