Stream in-memory data over Python Subprocess' Popen over external command

Question

What I want to achieve

I want to stream on a line-by-line basis elements from a generator-like object over a external program from Python.
Broken down i want something like Generator -> Popen(...) -> Generator without holding too much data in memory.

Here a working, minimal example which demonstrates what I want to achieve:


    from io import StringIO
    from subprocess import Popen, PIPE
    import time

    proc_input = StringIO("aa
bb
cc
dd")
    proc = Popen(["cat"], stdin=PIPE, stdout=PIPE)
    for line in  proc_input:
        proc.stdin.write(line.encode())
        yield proc.stdout.readline()
        time.sleep(1)

Problem: The proc.stdout.readline() just blocks and doesn't show anything.

What I already learned:

If the input comes from a file-like object (i.e. something which has fileno() implemented), I can pass this directly to stdin and avoid writing to the PIPE. But for doing so, I need first to stream the generator to a file, which I like to avoid as this seems to be a unnecessary detour. For example the following works.


    import tempfile
    from subprocess import Popen, PIPE

    tp = tempfile.TemporaryFile()
    tp.write("aa
bb
cc
dd".encode())
    tp.seek(0)
    proc = Popen(["cat"], stdin=tp, stdout=PIPE)
    for line in proc.stdout:
        print(line)

If I stick to writing to the PIPE object, I can resolve the problem by closing the input stream and then read from the output stream. But here I don't know where in the meantime the data lives. Cause I my generator yields GB of data, I do not want to run into memory errors.


    proc_input = StringIO("aa
bb
cc
dd")
    proc = Popen(["cat"], stdin=PIPE, stdout=PIPE)
    for line in  proc_input:
        proc.stdin.write(line.encode())
    proc.stdin.close()

    for line in proc.stdout:
            print(line)

What I also tried:

I played around with the buffersize argument Popen(..., bufsize=), but it seemed not to have any effect.
I tried writing the input data to io.BufferedWriter with the hope, that Popen can digest this as an input for stdin. Also without success.

Additional info: I'm using Linux.

Remarks to Comments

It was suggested to break the input generator into chunks. This can be achieved via

   def PopenStreaming(process, popen_kwargs, nlines, input):
        while input:
            proc = Popen(process, stdin=PIPE, stdout=PIPE, **popen_kwargs)
            for n, row in enumerate(input):
                proc.stdin.write(row)
                if n == nlines:
                    proc.stdin.close()
                    break
            for row in proc.stdout:
                yield row

finefoot · Accepted Answer

I'm not sure if it's always possible to do what you're trying to do. The docs at https://docs.python.org/3/library/subprocess.html say

Warning: Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.

So you're supposed to use communicate, but that means waiting for the process to terminate:

Popen.communicate(input=None, timeout=None) Interact with process: Send data to stdin. Read data from stdout and stderr, until end-of-file is reached. Wait for process to terminate.

That means you would be able to use communicate only once, which is not what you want.

However, I think using a line-buffered text mode should be safe to avoid a dead-lock:

from subprocess import Popen, PIPE

kwargs = {
    "stdin": PIPE,
    "stdout": PIPE,
    "universal_newlines": True,  # text mode
    "bufsize": 1,  # line buffered
}

with Popen(["cat"], **kwargs) as process:
    for data in ["A
", "B
", "C
"]:
        process.stdin.write(data)
        print("data sent:", data)
        output = process.stdout.readline()
        print("output received:", output)

If that isn't applicable in your case, maybe you can split your call into multiple smaller calls? Using check_output with its input keyword argument might also simplify your code:

from subprocess import check_output
output = check_output(["cat"], input=b"something
")
print(output)

Stream in-memory data over Python Subprocess' Popen over external command

Answers (1)

Related Questions

Stream in-memory data over Python Subprocess&#39; Popen over external command

Answers (1)

Related Questions

Stream in-memory data over Python Subprocess' Popen over external command