Reputation: 105083

How to make Popen() understand UTF-8 properly?

This is my code in Python:

[...]
proc = Popen(path, stdin=stdin, stdout=PIPE, stderr=PIPE)
result = [x for x in proc.stdout.readlines()]
result = ''.join(result);

Everything works fine, when it's ASCII. When I'm receiving UTF-8 text in stdout the result is unpredictable. In most cases the output is damaged. What is wrong here?

Btw, maybe this code should be optimized somehow?

Upvotes: 6

Answers (3)

Ruhollah Majdoddin

Reputation: 39

Short answer

Set enviornment variable PYTHONIOENCODING, and set the encoding in Popen:

#tst1.py
import subprocess
import sys, os

#print(sys.stdout.encoding)      #output: utf-8  this default for interactive console
os.environ['PYTHONIOENCODING'] =  'utf-8'
p = subprocess.Popen(['python', 'tst2.py'], encoding='utf-8', stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#print(p.stdout)                                        #output: <_io.TextIOWrapper name=3 encoding='utf-8'>
#print(p.stdout.encoding, '  ', p.stderr.encoding)       #ouput: utf-8    utf-8
outs, errors = p.communicate()
print(outs, errors)

where tst1.py, runs another python script tst2.py, like:

#tst2.py
import sys

print(sys.stdout.encoding)      #output: utf-8
print('\u2e85')  #a chinese char

Long Answer

Using PIPE, indicates that a pipe to the standard stream should be opened. A pipe, is a unidirectional data channel that can be used for interprocess communication. Pipes deal with binary, and are agnostic to the encoding. Applications on each side of the pipe should have consensus on the text encoding , if it is text (read more).

So firstly, stdout of tst2.py should have utf-8 encoding, otherwise it raises error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2e85' in position 0: character maps to <undefined>

The streams sys.stdout and sys.stderr are regular text files like those returned by the open() function. On Windows, non-character devices such as pipes and disk files use the system locale encoding (i.e. an ANSI codepage like CP1252). Under all platforms, you can override the character encoding by setting the PYTHONIOENCODING environment variable before running the interpreter.

Secondly, tst1.py should know how to read from pipe, thus the encoding='utf-8' in Popen.

More Details

With python 3.6+, following PEP 528, the default encoding of the interactive console in Windows is utf-8 (it can be changed by setting both PYTHONIOENCODING and PYTHONLEGACYWINDOWSSTDIO). But this does not apply to pipes and redirecting.

Upvotes: 3

hailinzeng

Reputation: 995

I run into the same issue when using LogPipe.

I solved this by specifying additional arguments encoding='utf-8', errors='ignore' to fdopen().

# https://codereview.stackexchange.com/questions/6567/redirecting-subprocesses-output-stdout-and-stderr-to-the-logging-module
class LogPipe(threading.Thread):
    def __init__(self):
        """Setup the object with a logger and a loglevel
        and start the thread
        """
        threading.Thread.__init__(self)
        self.daemon = False
        # self.level = level
        self.fdRead, self.fdWrite = os.pipe()
        self.pipeReader = os.fdopen(self.fdRead, encoding='utf-8', errors='ignore')  # set utf-8 encoding and just ignore illegal character
        self.start()

    def fileno(self):
        """Return the write file descriptor of the pipe
        """
        return self.fdWrite

    def run(self):
        """Run the thread, logging everything.
        """
        for line in iter(self.pipeReader.readline, ''):
            # vlogger.log(self.level, line.strip('\n'))
            vlogger.debug(line.strip('\n'))

        self.pipeReader.close()

    def close(self):
        """Close the write end of the pipe.
        """
        os.close(self.fdWrite)

Upvotes: 1

Eric O. Lebigot

Reputation: 94495

Have you tried decoding your string, and then combining your UTF-8 strings together? In Python 2.4+ (at least), this can be achieved with

result = [x.decode('utf8') for x in proc.stdout.readlines()]

The important point is that your lines x are sequences of bytes that must be interpreted as representing characters. The decode() method performs this interpretation (here, the bytes are assumed to be in the UTF-8 encoding): x.decode('utf8') is of type unicode, which you can think of as "string of characters" (which is different from "string of numbers between 0 and 255 [bytes]").