Reputation: 105083
This is my code in Python:
[...]
proc = Popen(path, stdin=stdin, stdout=PIPE, stderr=PIPE)
result = [x for x in proc.stdout.readlines()]
result = ''.join(result);
Everything works fine, when it's ASCII. When I'm receiving UTF-8 text in stdout
the result is unpredictable. In most cases the output is damaged. What is wrong here?
Btw, maybe this code should be optimized somehow?
Upvotes: 6
Views: 11319
Reputation: 39
Set enviornment variable PYTHONIOENCODING
, and set the encoding in Popen
:
#tst1.py
import subprocess
import sys, os
#print(sys.stdout.encoding) #output: utf-8 this default for interactive console
os.environ['PYTHONIOENCODING'] = 'utf-8'
p = subprocess.Popen(['python', 'tst2.py'], encoding='utf-8', stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#print(p.stdout) #output: <_io.TextIOWrapper name=3 encoding='utf-8'>
#print(p.stdout.encoding, ' ', p.stderr.encoding) #ouput: utf-8 utf-8
outs, errors = p.communicate()
print(outs, errors)
where tst1.py
, runs another python script tst2.py
, like:
#tst2.py
import sys
print(sys.stdout.encoding) #output: utf-8
print('\u2e85') #a chinese char
Using PIPE
, indicates that a pipe to the standard stream should be opened. A pipe, is a unidirectional data channel that can be used for interprocess communication. Pipes deal with binary, and are agnostic to the encoding. Applications on each side of the pipe should have consensus on the text encoding , if it is text (read more).
So firstly, stdout
of tst2.py
should have utf-8 encoding, otherwise it raises error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2e85' in position 0: character maps to <undefined>
The streams sys.stdout
and sys.stderr
are regular text files like those returned by the open()
function. On Windows, non-character devices such as pipes and disk files use the system locale encoding (i.e. an ANSI codepage like CP1252
). Under all platforms, you can override the character encoding by setting the PYTHONIOENCODING environment variable before running the interpreter.
Secondly, tst1.py
should know how to read from pipe, thus the encoding='utf-8'
in Popen
.
With python 3.6+, following PEP 528, the default encoding of the interactive console in Windows is utf-8 (it can be changed by setting both PYTHONIOENCODING
and PYTHONLEGACYWINDOWSSTDIO
). But this does not apply to pipes and redirecting.
Upvotes: 3
Reputation: 995
I run into the same issue when using LogPipe.
I solved this by specifying additional arguments encoding='utf-8', errors='ignore'
to fdopen().
# https://codereview.stackexchange.com/questions/6567/redirecting-subprocesses-output-stdout-and-stderr-to-the-logging-module
class LogPipe(threading.Thread):
def __init__(self):
"""Setup the object with a logger and a loglevel
and start the thread
"""
threading.Thread.__init__(self)
self.daemon = False
# self.level = level
self.fdRead, self.fdWrite = os.pipe()
self.pipeReader = os.fdopen(self.fdRead, encoding='utf-8', errors='ignore') # set utf-8 encoding and just ignore illegal character
self.start()
def fileno(self):
"""Return the write file descriptor of the pipe
"""
return self.fdWrite
def run(self):
"""Run the thread, logging everything.
"""
for line in iter(self.pipeReader.readline, ''):
# vlogger.log(self.level, line.strip('\n'))
vlogger.debug(line.strip('\n'))
self.pipeReader.close()
def close(self):
"""Close the write end of the pipe.
"""
os.close(self.fdWrite)
Upvotes: 1
Reputation: 94495
Have you tried decoding your string, and then combining your UTF-8 strings together? In Python 2.4+ (at least), this can be achieved with
result = [x.decode('utf8') for x in proc.stdout.readlines()]
The important point is that your lines x
are sequences of bytes that must be interpreted as representing characters. The decode()
method performs this interpretation (here, the bytes are assumed to be in the UTF-8 encoding): x.decode('utf8')
is of type unicode
, which you can think of as "string of characters" (which is different from "string of numbers between 0 and 255 [bytes]").
Upvotes: 6