Reputation: 2913
How can you pass and unpickle large objects through a subprocess. So my example below works for small object (dictionary), but stops working if it has large data in it:
Heres my working sample:
return_pickle.py
import pickle
import io
import sys
NUMS = 10
sample_obj = {'a':1, 'b': [x for x in range(NUMS)]}
d = pickle.dumps(sample_obj)
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='latin-1')
print(d.decode('latin-1'), end='', flush=True)
unpickle.py
import subprocess
import pickle
proc = subprocess.Popen(["python", "return_pickle.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output, err = proc.communicate()
data = pickle.loads(output)
print(data)
So the above works fine as is, but if I change NUMS
to 100
it errors out with _pickle.UnpicklingError: invalid load key, '\x0a'.
or if I change sample_obj to have a list of dictionaries, if the list is big I will get the same error. How do I get around this?
I am using Python 3.7 and on a Windows 10 machine
Upvotes: 1
Views: 802
Reputation: 70602
It will work if you add protocol=0
to your dumps()
call, but this is horribly convoluted. Proto 0 is "text mode", and is inefficient in many ways higher pickle protocols improved on, but on Windows it can make a huge difference.
The size of the objects doesn't really matter. Your example would fail if you just set NUMS
to 11. What happens: if an element in the list happens to be 10, pickle produces an "opcode" having a byte with value 10. But chr(10) == '\n'
, and in text mode output on Windows the implementation says "oh, a newline! I have to change that to carriage-return + newline instead".
So what was a single 10 byte in the original pickle stream gets corrupted to a 13 (\r
) byte followed by a 10 (\n
) byte. 13 ends up getting put into the list the unpickler is building, and then the leftover 10 makes no sense at all in context. That's the source of the "Invalid load key, '\x0a'" message - 0x0a == 10
.
Of course there are many other ways a byte with value 10 could end up in the pickle stream, but if you write in text mode they'll all get corrupted similarly on Windows.
There are straightforward, platform-independent ways to do this with binary pickles, all easier than trying to fool stdout into being something it wasn't intended to be. Easiest: pickle.dump(obj, f)
to a file opened in binary write mode on one end, then simply pickle.load(f)
on the other end for the same file opened for binary mode reading on the other end.
Inspired by @flakes, here's a different way to trick stdout into using binary mode, but relying only on documented portable APIs:
import os, sys, pickle
...
with os.fdopen(sys.stdout.fileno(), "wb", closefd=False) as stdout:
pickle.dump(sample_obj, stdout)
To show a possible complication, here's much the same thing using os.pipe()
. This gets annoying, because the OS pipe ends are "file descriptors" on Unix-y systems, but really "handles" on Windows. So the code you need depends on the platform you're using. I'll just cater to Windows here.
writepik.py, invoked by readpik.py:
import os, pickle, msvcrt, sys
data = {"d": 1, "L": list(range(50000))}
h = int(sys.argv[1])
d = msvcrt.open_osfhandle(h, 0)
with os.fdopen(d, "wb") as dest:
pickle.dump(data, dest)
So it's passed an integer "handle" on the command line, which it has to change into a "file descriptor", which is then passed to fdopen()
to create a file object long enough to dump the pickle.
readpik.py:
import os, pickle, msvcrt, subprocess
r, w = os.pipe()
h = msvcrt.get_osfhandle(w)
os.set_handle_inheritable(h, True)
proc = subprocess.Popen(["py", "writepik.py", str(h)], close_fds=False)
os.close(w)
with os.fdopen(r, "rb") as src:
data = pickle.load(src)
print(data)
So that's somewhat the reverse. os.pipe()
returns "file descriptors", but for a subprocess to properly inherit an open Windows handle we have to make the handle inheritable instead of the file descriptor. So we get the numeric "handle" via get_osfhandle(w)
long enough to mark it inheritable and plug its value into the command line for writepik.py.
It's not really hard, but the dance is delicate and so easy to get wrong.
Upvotes: 2
Reputation: 23634
Works for me on a windows machine if you don't stringify the result and instead post it directly to the stdout buffer:
return_pickle.py
import pickle, sys
sample_obj = {'a':1, 'b': [x for x in range(100)]}
sys.stdout.buffer.write(pickle.dumps(sample_obj))
import subprocess, pickle
proc = subprocess.Popen(
["python", "return_pickle.py"],
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
)
output, _ = proc.communicate()
print(pickle.loads(output))
Upvotes: 2