Sir l33tname
Sir l33tname

Reputation: 4330

Configure encoding of a python stream

Is there a way (on python2 and python3) to configure tmp_stdout to use a different encoding?

(I know that on python3 there is the encoding parameter but this is not possible on python2)

import tempfile
import sys

original_stdout = sys.stdout

with tempfile.TemporaryFile(mode="w+") as tmp_stdout:
    # patch sys.stdout  
    sys.stdout = tmp_stdout

    print("📙")
    tmp_stdout.seek(0)
    actual_output = tmp_stdout.read()

# restore stdout
sys.stdout = original_stdout

Also why is the default encoding on windows cp1252 even when my Command Prompt usese cp850.

This is the error you get when you run it on windows with python3.6

Traceback (most recent call last):
  File "Desktop\test.py", line 11, in <module>
    print("📙")
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\tempfile.py", line 483, in func_wrapper
    return func(*args, **kwargs)
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4d9' in position 0: character maps to <undefined>

Upvotes: 0

Views: 1091

Answers (1)

Eryk Sun
Eryk Sun

Reputation: 34270

The Windows console defaults to the system OEM codepage (e.g. 850 in Western Europe), which supports legacy DOS programs and batch scripts, but serves no real purpose nowadays. Python 3.6+ uses the console's Unicode API instead. Internally this is UTF-16LE, but at the buffer/raw layer it presents as UTF-8 for cross-platform compatibility. To get similar support in Python 2, install and enable win_unicode_console.

For non-console files, the default encoding in Python 3 is the system ANSI codepage (e.g. 1252 in Western Europe). This is the classic default for many text editors in Windows, such as notepad. To get the full range of Unicode, override the encoding using the argument encoding='utf-8'. To support this in both Python 2 and 3, you can wrap the file descriptor (i.e. fileno()) using the io module, which was backported when Python 3 was released (2.6+). For example:

import sys
import tempfile

with tempfile.TemporaryFile(mode='w+b') as tmp:
    tmp_stdout = io.open(tmp.fileno(), mode='w+', encoding='utf-8', closefd=False)

    sys.stdout, original_stdout = tmp_stdout, sys.stdout
    try:
        print("📙")
    finally:
        sys.stdout = original_stdout

    tmp_stdout.seek(0)
    actual_output = tmp_stdout.read()

Note that the temp file is opened with the mode "w+b", which avoids the C runtime's low-level text mode in Python 2 on Windows, which we don't want because it handles the character 0x1A (i.e. Ctrl+Z) as the end-of-file marker (a legacy from DOS and CP/M) and does newline translation (e.g. LF -> CRLF). The io module's TextIOWrapper already implements newline translation. Note also that the io.open call uses closefd=False since tmp is already closed automatically in the with statement.

Upvotes: 3

Related Questions